LMDB KDB module design notes

This seems reasonable. I'm glad to see MIT considering LMDB (my
experiences with it are positive).

Post by Greg Hudson
I have been considering how MIT krb5 might implement an LMDB KDB
module.
LMDB operations take place within read or write transactions. Read
transactions do not block write transactions; instead, read transactions
delay the reclamation of pages obsoleted by write transactions. This is
attractive for a KDB, as it means "kdb5_util dump" can take a snapshot
of the database without blocking password changes or administrative
operations. (The DB2 module allows this with the "unlockiter" DB
option, but that option carries a noticeable performance penalty, causes
kdb5_util dump to write something which isn't exactly a snapshot, and is
probably open to rare edge cases where an admin deletes a principal
entry right as it's being iterated through.)
"kdb5_util load" is our one transactional write operation. It calls
krb5_db_create() with the "temporary" DB option, puts principal and
policy entries, and then calls krb5_db_promote() to make the new KBD
visible. The DB2 module handles this by creating side databases and
lockfiles with a "~" extension, and then renaming them into place. For
this to work, each kdb_db2 operation needs to close and reopen the
database.
The three lockout fields of principal entries (last_success,
last_failed, and fail_auth_count) add additional complexity. These
fields are updated by the KDC by default, and are not replicated in an
iprop setup. iprop loads include the "merge_nra" DB option when
creating the side database, indicating that existing principal entries
should retain their current lockout attribute values.
Here is my general design framework, taking the above into
* We use two MDB environments, setting the MDB_NOSUBDIR flag so that
- A primary environment (suffix ".mdb") containing a "policy" database
holding policy entries and a "principal" database holding principal
entries minus lockout fields.
- A secondary environment (suffix ".lockout.mdb") containing a
"lockout" database holding principal lockout fields.
The KDC only needs to write to the lockout environment, and can open
the primary environment read-only.
The lockout environment is never emptied, never iterated over, and
uses only short-lived transactions, so the KDC is never blocked more
than briefly.
* For creations with the "temporary" DB option, instead of creating a
side database, we open or create the usual environment files, begin a
write transaction on the primary environment for the lifetime of the
database context, and open and drop the principal and policy databases
within that transaction. put_principal and put_policy operations use
the database context write transaction instead of creating short-lived
ones. When the database is promoted, we commit the write transaction
and the load becomes visible.
To maintain the low-contention nature of the lockout environment, we
compromise on the transactionality of load operations for the lockout
fields. We do not empty the lockout database on a load and we write
entries to it as put_principal operations occur during the load.
- updates to the lockout fields become visible immediately (for
existing principal entries), instead of at the end of the load.
- updates to the lockout fields remain visible (for existing principal
entries) if the load operation is aborted.
- since we don't empty the lockout database, we leave garbage entries
behind for old principals which have disappeared from the dump file
we loaded.
I don't anticipate any of those behaviors being noticeable in
practice. We could provide a tool to remove the garbage entries in
the lockout database if it becomes an issue for anyone.
* For iprop loads, we set a context flag if we see the "merge_nra" DB
option at creation time. If the context flag is set, put_principal
operations check for existing entries in the lockout database before
writing, and do nothing if an entry is already there.
* To iterate over principals or policies, we create a read transaction
in the primary MDB environment for the lifetime of the cursor. By
default, LMDB only allows one transaction per environment per thread.
This would break "kdb5_util update_princ_encryption", which does
put_principal operations during iteration. Therefore, we must specify
the MDB_NOTLS flag in the primary environment.
The MDB_NOTLS flag carries a performance penalty for the creation of
read transactions. To mitigate this penalty, we can save a read
transaction handle in the DB context for get operations, using
mdb_txn_reset() and mdb_txn_renew() between operations.
* The existing in-tree KDB modules allow simultaneous access to the same
DB context by multiple threads, even though the KDC and kadmind are
single-threaded and we don't allow krb5_context objects to be used by
multiple threads simultaneously. For the LMDB module, we will need to
either synchronize the use of transaction handles, or document that it
isn't thread-safe and will need mutexes added if it needs to be
thread-safe in the future.
* LMDB files are capped at the memory map size, which is 10MB by
default. Heimdal exposes this as a configuration option and we should
probably do the same; we might also want a larger default like 128MB.
We will have to consider how to apply any default map size to the
lockout environment as well as the primary environment.
* LMDB also has a configurable maximum number of readers. The default
of 126 is probably adequate for most deployments, but we again
probably want a configuration option in case it needs to be raised.
* By default LMDB calls fsync() or fdatasync() for each committed write
transaction. This probably overshadows the performance benefits of
LMDB versus DB2, in exchange for improved durability. I think we will
want to always set the MDB_NOSYNC flag for the lockout environment,
and might need to add an option to set it for the primary environment.
_______________________________________________
https://mailman.mit.edu/mailman/listinfo/krbdev

_______________________________________________
krbdev mailing list ***@mit.edu
https://mailman.mit.edu/mailman/listinfo/krbdev

Robbie Harwood

2018-04-10 16:47:40 UTC

Post by Greg Hudson
Here is my general design framework, taking the above into
* We use two MDB environments, setting the MDB_NOSUBDIR flag so that
- A primary environment (suffix ".mdb") containing a "policy"
database holding policy entries and a "principal" database holding
principal entries minus lockout fields.
- A secondary environment (suffix ".lockout.mdb") containing a
"lockout" database holding principal lockout fields.
The KDC only needs to write to the lockout environment, and can open
the primary environment read-only.
The lockout environment is never emptied, never iterated over, and
uses only short-lived transactions, so the KDC is never blocked more
than briefly.

Overall this design seems good to me.

It's hard to tell from the docs - is there a disadvantage to
MDB_NOSUBDIR? It seems weird to have it as an option but not the
default.

The size is capped at the number of principals that have ever existed,
right? I'm also not worried about it then.

Post by Greg Hudson
* The existing in-tree KDB modules allow simultaneous access to the same
DB context by multiple threads, even though the KDC and kadmind are
single-threaded and we don't allow krb5_context objects to be used by
multiple threads simultaneously. For the LMDB module, we will need to
either synchronize the use of transaction handles, or document that it
isn't thread-safe and will need mutexes added if it needs to be
thread-safe in the future.
* LMDB files are capped at the memory map size, which is 10MB by
default. Heimdal exposes this as a configuration option and we should
probably do the same; we might also want a larger default like 128MB.
We will have to consider how to apply any default map size to the
lockout environment as well as the primary environment.

What will the failure modes look like on this? Does LMDB return useful
information around the caps?

Post by Greg Hudson
* LMDB also has a configurable maximum number of readers. The default
of 126 is probably adequate for most deployments, but we again
probably want a configuration option in case it needs to be raised.

Agreed.

Post by Greg Hudson
* By default LMDB calls fsync() or fdatasync() for each committed
write transaction. This probably overshadows the performance benefits
of LMDB versus DB2, in exchange for improved durability. I think we
will want to always set the MDB_NOSYNC flag for the lockout
environment, and might need to add an option to set it for the primary
environment.

Agreed. Primary will be needed, even if only for testing.

Thanks,
--Robbie

Greg Hudson

2018-04-15 15:58:05 UTC

I have prototype code for this design which passes the test suite
(temporarily modified to create LMDB KDBs for Python and tests/dejagnu,
and with a few BDB-specific tests skipped). I'm working on polishing it
and adding documentation and proper tests.

Post by Robbie Harwood
It's hard to tell from the docs - is there a disadvantage to
MDB_NOSUBDIR? It seems weird to have it as an option but not the
default.

LMDB uses two files per database. By default, they have the suffixes
"/data.mdb" and "/lock.mdb"; with MDB_NOSUBDIR, they have the suffixes
"" and "-lock". The default has the advantage that it uses exactly the
directory entry given to it and no others, though it is up to the
consumer to create the directory.

Post by Robbie Harwood
From our perspective, the main drawback of MDB_NOSUBDIR is that our

destroy method needs to just know about the MDB_NOSUBDIR suffixes in
order to clean up the files. If we used the default, we could nuke the
directory (annoyingly hard to do in C) with no special knowledge.

Post by Robbie Harwood

Post by Greg Hudson
* LMDB files are capped at the memory map size, which is 10MB by
default. Heimdal exposes this as a configuration option and we should
probably do the same; we might also want a larger default like 128MB.
We will have to consider how to apply any default map size to the
lockout environment as well as the primary environment.

What will the failure modes look like on this? Does LMDB return useful
information around the caps?

With my prototype code, an admin would see something like:

add_principal: LMDB write failure (path: /me/krb5/build/testdir/db.mdb):
MDB_MAP_FULL: Environment mapsize limit reached while creating
"***@KRBTEST.COM".

where the "MDB_MAP_FULL...reached" part comes from mdb_strerror(). We
could intercept MDB_MAP_FULL and say something else there.

I measured that each principal entry takes about 430 bytes in the main
environment (with the default of AES-128 and AES-256 keys, and a name
length of about 22 bytes) and about 100 bytes in the lockout
environment. With these lengths, a 128MB map size for the main
environment would accomodate around 300K principal entries. The LMDB
default of 10MB would accomodate around 25K entries.

Post by Robbie Harwood

Agreed. Primary will be needed, even if only for testing.

I haven't added a nosync option to my prototype code yet, and the test
suite didn't seem painfully slow using LMDB. But I will likely add it
and use it for testing anyway.

Without adding another message to the thread, I will address Andrew

Post by Robbie Harwood
I just lurk here, but I have to agree with Simo here from Samba
experience. Be very careful about lock ordering between multiple
databases.

In this design, transactions on the lockout environment are all
ephemeral, consisting of at most one get and one put. There is no
iteration over it and no need to consult the primary environment during
a lockout transaction. So I don't think deadlock is a concern.

If we ever supply a tool to collect garbage entries in the lockout
database, that tool would hold open a read transaction to iterate over
the lockout DB and do gets (to test existence) on the primary
environment as it went. But since read transactions don't block other
transactions in LMDB, there is still no deadlock risk.
_______________________________________________
krbdev mailing list ***@mit.edu
https://mailman.mit.edu/mailman/listinfo/krbdev

Simo Sorce

2018-04-12 12:03:27 UTC

I am not a fan of setups that use multiple files for databases, especially when
transactions need to span multiple ones.
What is the underlying reason to do this in the new design instead of using a
single database file with all the data ?

Post by Greg Hudson
* For creations with the "temporary" DB option, instead of creating a
side database, we open or create the usual environment files, begin a
write transaction on the primary environment for the lifetime of the
database context, and open and drop the principal and policy databases
within that transaction. put_principal and put_policy operations use
the database context write transaction instead of creating short-lived
ones. When the database is promoted, we commit the write transaction
and the load becomes visible.
To maintain the low-contention nature of the lockout environment, we
compromise on the transactionality of load operations for the lockout
fields. We do not empty the lockout database on a load and we write
entries to it as put_principal operations occur during the load.
- updates to the lockout fields become visible immediately (for
existing principal entries), instead of at the end of the load.
- updates to the lockout fields remain visible (for existing principal
entries) if the load operation is aborted.
- since we don't empty the lockout database, we leave garbage entries
behind for old principals which have disappeared from the dump file
we loaded.
I don't anticipate any of those behaviors being noticeable in
practice. We could provide a tool to remove the garbage entries in
the lockout database if it becomes an issue for anyone.
* For iprop loads, we set a context flag if we see the "merge_nra" DB
option at creation time. If the context flag is set, put_principal
operations check for existing entries in the lockout database before
writing, and do nothing if an entry is already there.
* To iterate over principals or policies, we create a read transaction
in the primary MDB environment for the lifetime of the cursor. By
default, LMDB only allows one transaction per environment per thread.
This would break "kdb5_util update_princ_encryption", which does
put_principal operations during iteration. Therefore, we must specify
the MDB_NOTLS flag in the primary environment.
The MDB_NOTLS flag carries a performance penalty for the creation of
read transactions. To mitigate this penalty, we can save a read
transaction handle in the DB context for get operations, using
mdb_txn_reset() and mdb_txn_renew() between operations.
* The existing in-tree KDB modules allow simultaneous access to the same
DB context by multiple threads, even though the KDC and kadmind are
single-threaded and we don't allow krb5_context objects to be used by
multiple threads simultaneously. For the LMDB module, we will need to
either synchronize the use of transaction handles, or document that it
isn't thread-safe and will need mutexes added if it needs to be
thread-safe in the future.
* LMDB files are capped at the memory map size, which is 10MB by
default. Heimdal exposes this as a configuration option and we should
probably do the same; we might also want a larger default like 128MB.
We will have to consider how to apply any default map size to the
lockout environment as well as the primary environment.
* LMDB also has a configurable maximum number of readers. The default
of 126 is probably adequate for most deployments, but we again
probably want a configuration option in case it needs to be raised.
* By default LMDB calls fsync() or fdatasync() for each committed write
transaction. This probably overshadows the performance benefits of
LMDB versus DB2, in exchange for improved durability. I think we will
want to always set the MDB_NOSYNC flag for the lockout environment,
and might need to add an option to set it for the primary environment.
_______________________________________________
https://mailman.mit.edu/mailman/listinfo/krbdev

--
Simo Sorce
Sr. Principal Software Engineer
Red Hat, Inc

_______________________________________________
krbdev mailing list ***@mit.edu
https://mailman.mit.edu/mailman/listinfo/krbdev

Greg Hudson

2018-04-12 14:59:15 UTC

Post by Greg Hudson
The lockout environment is never emptied, never iterated over, and
uses only short-lived transactions, so the KDC is never blocked more
than briefly.

Transactions are per-environment, so if we use one database file and a
write transaction for loads, loads would block the KDC. That's worse
than what we have with DB2.

Alternatively we could load into a temporary database file and rename it
into place like we do with DB2. But we would then have to close and
reopen the database between operations like we do with DB2, or somehow
signal processes that have the database open to reopen it after a load
completes.
_______________________________________________
krbdev mailing list ***@mit.edu
https://mailman.mit.edu/mailman/listinfo/krbdev

Simo Sorce

2018-04-12 15:56:52 UTC

Post by Greg Hudson
The lockout environment is never emptied, never iterated over, and
uses only short-lived transactions, so the KDC is never blocked more
than briefly.

Transactions are per-environment, so if we use one database file and a
write transaction for loads, loads would block the KDC. That's worse
than what we have with DB2.
Alternatively we could load into a temporary database file and rename it
into place like we do with DB2. But we would then have to close and
reopen the database between operations like we do with DB2, or somehow
signal processes that have the database open to reopen it after a load
completes.

How common are loads ?
As far as I know LMDB will let you keep reading during a transaction, so
the KDC would block only if there are write operations, but won't block
in general, right ?

The only write operations are on AS requests, when lockout are enabled
and when that triggers a change in the lockout fields. How common is that ?
Would that something that can be mitigated by deferring those writes during
transactions ?

Simo.

--
Simo Sorce
Sr. Principal Software Engineer
Red Hat, Inc

_______________________________________________
krbdev mailing list ***@mit.edu
https://mailman.mit.edu/mailman/listinfo/krbdev

Greg Hudson

2018-04-12 16:34:08 UTC

Post by Greg Hudson
Transactions are per-environment, so if we use one database file and a
write transaction for loads, loads would block the KDC. That's worse
than what we have with DB2.
Alternatively we could load into a temporary database file and rename it
into place like we do with DB2. But we would then have to close and
reopen the database between operations like we do with DB2, or somehow
signal processes that have the database open to reopen it after a load
completes.

How common are loads ?

That's hard to predict, but for a large database, having the KDC block
for the lifetime of a load operation seems like a pretty noticeable problem.

Post by Simo Sorce
As far as I know LMDB will let you keep reading during a transaction, so
the KDC would block only if there are write operations, but won't block
in general, right ?

Yes.

Post by Simo Sorce
The only write operations are on AS requests, when lockout are enabled
and when that triggers a change in the lockout fields. How common is that ?

By default, every successful AS request on a principal requiring preauth
updates the last_success timestamp. If disable_last_success is set (but
disable_lockout is not), only failed AS requests would cause a KDC write.

Post by Simo Sorce
Would that something that can be mitigated by deferring those writes during
transactions ?

I don't see a way in LMDB to check for a write transaction, or begin a
write transaction without blocking. Queueing those writes to be
performed later would also add a lot of complexity.
_______________________________________________
krbdev mailing list ***@mit.edu
https://mailman.mit.edu/mailman/listinfo/krbdev

Nathaniel McCallum

2018-04-15 20:51:57 UTC

Post by Greg Hudson
Transactions are per-environment, so if we use one database file and a
write transaction for loads, loads would block the KDC. That's worse
than what we have with DB2.

Could loads be segmented into chunks to avoid blocking the KDC for the
entire operation?

Post by Greg Hudson
Alternatively we could load into a temporary database file and rename it
into place like we do with DB2. But we would then have to close and
reopen the database between operations like we do with DB2, or somehow
signal processes that have the database open to reopen it after a load
completes.

How common are loads ?

That's hard to predict, but for a large database, having the KDC block
for the lifetime of a load operation seems like a pretty noticeable problem.

Post by Simo Sorce
As far as I know LMDB will let you keep reading during a transaction, so
the KDC would block only if there are write operations, but won't block
in general, right ?

Yes.

Post by Simo Sorce
The only write operations are on AS requests, when lockout are enabled
and when that triggers a change in the lockout fields. How common is that ?

Could you create an opportunistic write queue for these ? This would
unblock the KDC during loads. The cost would be that the
aforementioned writes would not occur until the ends of the loads.
Slightly stale data is probably not a big deal for (at least)
last_success.

Post by Simo Sorce
Would that something that can be mitigated by deferring those writes during
transactions ?

_______________________________________________
krbdev mailing list ***@mit.edu
https://mailman.mit.edu/mailman/listinfo/krbdev

Greg Hudson

2018-04-15 21:43:28 UTC

Post by Nathaniel McCallum

Post by Greg Hudson
Transactions are per-environment, so if we use one database file and a
write transaction for loads, loads would block the KDC. That's worse
than what we have with DB2.

Could loads be segmented into chunks to avoid blocking the KDC for the
entire operation?

I don't believe so. If loads were purely additive, sure, but they also
delete the entries not present in the file.

Post by Nathaniel McCallum
Could you create an opportunistic write queue for these ? This would
unblock the KDC during loads. The cost would be that the
aforementioned writes would not occur until the ends of the loads.
Slightly stale data is probably not a big deal for (at least)
last_success.

I would like to see an objection more substantial than "I don't like
having multiple DB files" before thinking about adding something as
complicated as a KDC database write queue. We've been living with
multiple DB files in the DB2 back end (for principals and policies) for
the lifetime of the project.
_______________________________________________
krbdev mailing list ***@mit.edu
https://mailman.mit.edu/mailman/listinfo/krbdev

Andrew Bartlett

2018-04-12 19:04:00 UTC