Gss context refresh failure due to clock skew

Sorry for the delay; Andy's mail got stuck in the krbdev moderation
queue by mistake.

The situation occurs as follows.

I am a little bit confused by this description because of terminology
issues. In your description, you appear to use the phrase "TGS" to
refer to service tickets (i.e. tickets whose service principal is
nfs/server.name), but I can't be sure. The actual meaning of "TGS" is
"ticket-granting service," i.e. the KDC service whose principal name is
krbtgt/REALM.

2) For convenience, I set the TGS lifetimes to be as short as possible, 10 minutes for Win2008R2 AD which I test with.

Are you setting the maximum lifetime for nfs/server.name tickets to 10
minutes, but still allowing ticket-granting tickets to have a lifetime
of multiple hours?

12) Wait until the client clock is past the server TGS expiry time
13) re-try the mkdir - it succeeds after a successful GSS INIT NULL call exchange for both servers.

If I understand correctly, this request succeeds because
krb5_get_credentials() ignores the expired cached service ticket and
makes a TGS request for a new service ticket. The cache now contains:

* A ticket for krbtgt/REALM with hours remaining
* A ticket for nfs/server.name which expired recently
* Another ticket for nfs/server.name which expires in ten minutes

Is that correct?

Shouldn’t these refresh calls succeed? Isn’t the Kerberos clock skew supposed to handle this situation?

I think this case doesn't arise often because people don't often set
maximum service ticket lifetimes to be shorter than maximum TGT
lifetimes. If the TGT itself has expired or is about to expire, some
out-of-band agent needs to refresh the TGT somehow, and it doesn't
matter all that much whether the failure comes from the client or the
server.

That said, your scenario should work, and it doesn't. The primary cause
is an explicit check added to the krb5 mech's gss_accept_sec_context()
implementation in 1996 (before the MIT krb5 1.0 release), which checks
the ticket endtime with no allowance for clock skew. I don't know
precisely why the check was added, but my guess it is for the
computation of the context validity lifetime; it would make no sense to
tell the application "the authentication succeeded and the resulting
context is valid for the next -3 minutes."

Perhaps a better choice would be to remove this check, and instead add
the clock skew to the validity lifetime of GSS krb5 acceptor contexts.
_______________________________________________
krbdev mailing list ***@mit.edu
https://mailman.mit.ed

Adamson, Andy

2015-10-05 19:35:39 UTC

Post by Greg Hudson
Sorry for the delay; Andy's mail got stuck in the krbdev moderation
queue by mistake.

The situation occurs as follows.

Hi Greg

Pardon my terminology gaff. I mean a ticket for nfs/server.name.

2) For convenience, I set the TGS lifetimes to be as short as possible, 10 minutes for Win2008R2 AD which I test with.

Are you setting the maximum lifetime for nfs/server.name tickets to 10
minutes, but still allowing ticket-granting tickets to have a lifetime
of multiple hours?

[***@rhel6-7ga sles-kernel]# klist -ce /tmp/krb5cc_machine_ANDROSAD.FAKE
Ticket cache: FILE:/tmp/krb5cc_machine_ANDROSAD.FAKE
Default principal: nfs/rhel6-***@ANDROSAD.FAKE

Valid starting Expires Service principal
09/30/15 11:57:02 09/30/15 12:57:02 krbtgt/***@ANDROSAD.FAKE
renew until 10/07/15 11:57:02, Etype (skey, tkt): aes256-cts-hmac-sha1-96, aes256-cts-hmac-sha1-96
09/30/15 11:57:02 09/30/15 12:07:02 nfs/rhel7-1ga-***@ANDROSAD.FAKE
renew until 10/07/15 11:57:02, Etype (skey, tkt): arcfour-hmac, arcfour-hmac

12) Wait until the client clock is past the server TGS expiry time
13) re-try the mkdir - it succeeds after a successful GSS INIT NULL call exchange for both servers.

If I understand correctly, this request succeeds because
krb5_get_credentials() ignores the expired cached service ticket and
* A ticket for krbtgt/REALM with hours remaining
* A ticket for nfs/server.name which expired recently
* Another ticket for nfs/server.name which expires in ten minutes
Is that correct?

Yes, and the new service ticket produces an RPCSEC_GSS_INIT token that has an expiry that passes the servers clock test.

Shouldn’t these refresh calls succeed? Isn’t the Kerberos clock skew supposed to handle this situation?

I think this case doesn't arise often because people don't often set
maximum service ticket lifetimes to be shorter than maximum TGT
lifetimes.

Not the cause of the issue. The service ticket lifetime of 10 minutes is just there for testing this issue as I needed to wait until the service ticket had ‘expired’ on the server - but not yet on the client.

We see this issue all the time in NetApp QA as we run mutiple day heavy IO tests against a kerberos mount. If the server clock is ahead of the client clock, permission denied errors stop the test as the first service ticket “expires” on the server but not on the client.

Post by Greg Hudson
If the TGT itself has expired or is about to expire, some
out-of-band agent needs to refresh the TGT somehow, and it doesn't
matter all that much whether the failure comes from the client or the
server.

I thought that having a keytab entry and a renewable TGT was enough.

Post by Greg Hudson
That said, your scenario should work, and it doesn't. The primary cause
is an explicit check added to the krb5 mech's gss_accept_sec_context()
implementation in 1996 (before the MIT krb5 1.0 release), which checks
the ticket endtime with no allowance for clock skew. I don't know
precisely why the check was added, but my guess it is for the
computation of the context validity lifetime; it would make no sense to
tell the application "the authentication succeeded and the resulting
context is valid for the next -3 minutes.”

That also makes no sense - simply use the kerberos clock skew in the message. e.g. if the clock skew is 5 minutes, and if according to the server clock the ticket has been expired for 2 minutes, then the message becomes "the authentication succeeded and the resulting context is valid for the next 3 minutes.” as there are 3 minutes left in the server clock time cavat the configured kerberos clock skew.

Post by Greg Hudson
Perhaps a better choice would be to remove this check, and instead add
the clock skew to the validity lifetime of GSS krb5 acceptor contexts.

Yes. That is my opinion.

—>Andy

_______________________________________________
krbdev mailing list ***@mit.edu
https://mailman.mit.edu/mailman/list

Greg Hudson

2015-10-05 20:02:07 UTC

Post by Greg Hudson
I think this case doesn't arise often because people don't often set
maximum service ticket lifetimes to be shorter than maximum TGT
lifetimes.

Not the cause of the issue. The service ticket lifetime of 10 minutes is just there for testing this issue as I needed to wait until the service ticket had ‘expired’ on the server - but not yet on the client.
We see this issue all the time in NetApp QA as we run mutiple day heavy IO tests against a kerberos mount. If the server clock is ahead of the client clock, permission denied errors stop the test as the first service ticket “expires” on the server but not on the client.

If the issue is not caused by short-lifetime service principals, then
the test scenario you described isn't representative of the real
scenario. To reproduce the problem as it manifests in your IO tests,
you will need to adjust the TGT lifetime down to ten minutes as well as
the nfs/server lifetime.

I thought that having a keytab entry and a renewable TGT was enough.

I'm not sure why you would do both of these; if you're getting initial
creds with a keytab, there is no need to muck around with ticket renewal.

Anyway, gss_init_sec_context() never renews tickets, and only gets
tickets from a keytab when a client keytab is configured (new in 1.11).
When tickets are obtained using a client keytab, they are refreshed
from the keytab when they are halfway to expiring, so this clock skew
issue should not arise, so I don't think that feature is being used.

It is possible that the NFS client code has its own separate logic for
obtaining new tickets using a keytab. If so, we need to understand how
it works. It's possible (though unlikely) that changing the behavior of
gss_accept_sec_context() wouldn't be sufficient by itself.
_______________________________________________
krbdev mailing list ***@mit.edu
https://ma

Adamson, Andy

2015-10-05 20:34:42 UTC

Post by Greg Hudson
I think this case doesn't arise often because people don't often set
maximum service ticket lifetimes to be shorter than maximum TGT
lifetimes.

Not the cause of the issue. The service ticket lifetime of 10 minutes is just there for testing this issue as I needed to wait until the service ticket had ‘expired’ on the server - but not yet on the client.
We see this issue all the time in NetApp QA as we run mutiple day heavy IO tests against a kerberos mount. If the server clock is ahead of the client clock, permission denied errors stop the test as the first service ticket “expires” on the server but not on the client.

If the issue is not caused by short-lifetime service principals,

I was wrong - you are right, it is caused by service ticket lifetimes being shorter than TGT lifetimes.

I didn’t know setting the service ticket lifetimes to not be less than TGT lifetimes was a requirement. Neither does NetApp QA and I suspect, neither do customers in general.

Post by Greg Hudson
then
the test scenario you described isn't representative of the real
scenario. To reproduce the problem as it manifests in your IO tests,
you will need to adjust the TGT lifetime down to ten minutes as well as
the nfs/server lifetime.

Code was added to rpc.gssd, the NFS client agent that creates GSS contexts for NFS, to take into account the clock skew and get a new TGT before (now+clock skew). So if the service ticket lifetime is equal to or greater than the TGT lifetime, then all is well.

I thought that having a keytab entry and a renewable TGT was enough.

I'm not sure why you would do both of these; if you're getting initial
creds with a keytab, there is no need to muck around with ticket renewal.

I wouldn’t, but QA and customers do.

Post by Greg Hudson
Anyway, gss_init_sec_context() never renews tickets, and only gets
tickets from a keytab when a client keytab is configured (new in 1.11).
When tickets are obtained using a client keytab, they are refreshed
from the keytab when they are halfway to expiring,

refreshed by…?

Post by Greg Hudson
so this clock skew
issue should not arise, so I don't think that feature is being used.
It is possible that the NFS client code has its own separate logic for
obtaining new tickets using a keytab.

When an NFS request requires a GSS context, if the context does not exist, is not valid, or if it is valid but the server replies to an RPC request using a GSS context with an RPC error that indicates it’s side of the GSS context has a problem, the client kernel does an upcall to rpc.gssd which then decides if a new service ticket is required to send an RPCSEC_GSS_INIT message to the server to create a new GSS context. The resultant GSS context is stored in the client kernel with a lifetime equal to the service ticket used to create it.

If rpc.gssd calls the code that refreshes the tickets from the keytab when they are half way to expiring’ then that should mitigate the clock skew issue.

Post by Greg Hudson
If so, we need to understand how
it works. It's possible (though unlikely) that changing the behavior of
gss_accept_sec_context() wouldn't be sufficient by itself.

_______________________________________________
krbdev mailing list ***@mit.edu
https://mailman.mit.edu/mailman/listinfo/krb

Benjamin Kaduk

2015-10-06 00:02:11 UTC

Post by Greg Hudson
I think this case doesn't arise often because people don't often set
maximum service ticket lifetimes to be shorter than maximum TGT
lifetimes.

Not the cause of the issue. The service ticket lifetime of 10 minutes is just there for testing this issue as I needed to wait until the service ticket had âexpiredâ on the server - but not yet on the client.
We see this issue all the time in NetApp QA as we run mutiple day heavy IO tests against a kerberos mount. If the server clock is ahead of the client clock, permission denied errors stop the test as the first service ticket âexpiresâ on the server but not on the client.

If the issue is not caused by short-lifetime service principals,

I was wrong - you are right, it is caused by service ticket lifetimes being shorter than TGT lifetimes.
I didnât know setting the service ticket lifetimes to not be less than
TGT lifetimes was a requirement. Neither does NetApp QA and I suspect,
neither do customers in general.

It's not a requirement. (Greg explicitly said "That said, your scenario
should work, and it doesn't." in his first message.)

Code was added to rpc.gssd, the NFS client agent that creates GSS
contexts for NFS, to take into account the clock skew and get a new TGT
before (now+clock skew). So if the service ticket lifetime is equal to
or greater than the TGT lifetime, then all is well.

I thought that having a keytab entry and a renewable TGT was enough.

I'm not sure why you would do both of these; if you're getting initial
creds with a keytab, there is no need to muck around with ticket renewal.

I wouldnât, but QA and customers do.

refreshed byâŠ?

The GSS library itself.
http://k5wiki.kerberos.org/wiki/Projects/Keytab_initiation and
http://web.mit.edu/kerberos/krb5-latest/doc/basic/keytab_def.html#default-client-keytab
give a little bit of intro, though this feature could benefit from better
documentation.

-Ben

When an NFS request requires a GSS context, if the context does not
exist, is not valid, or if it is valid but the server replies to an RPC
request using a GSS context with an RPC error that indicates itâs side
of the GSS context has a problem, the client kernel does an upcall to
rpc.gssd which then decides if a new service ticket is required to send
an RPCSEC_GSS_INIT message to the server to create a new GSS context.
The resultant GSS context is stored in the client kernel with a lifetime
equal to the service ticket used to create it.
If rpc.gssd calls the code that refreshes the tickets from the keytab
when they are half way to expiringâ then that should mitigate the clock
skew issue.

Post by Greg Hudson
If so, we need to understand how
it works. It's possible (though unlikely) that changing the behavior of
gss_accept_sec_context() wouldn't be sufficient by itself.

_______________________________________________
https://mailman.mit.edu/mailman/listinfo/krbdev

Adamson, Andy

2015-10-06 14:53:16 UTC

Post by Benjamin Kaduk

Post by Greg Hudson
I think this case doesn't arise often because people don't often set
maximum service ticket lifetimes to be shorter than maximum TGT
lifetimes.

Not the cause of the issue. The service ticket lifetime of 10 minutes is just there for testing this issue as I needed to wait until the service ticket had ‘expired’ on the server - but not yet on the client.
We see this issue all the time in NetApp QA as we run mutiple day heavy IO tests against a kerberos mount. If the server clock is ahead of the client clock, permission denied errors stop the test as the first service ticket “expires” on the server but not on the client.

If the issue is not caused by short-lifetime service principals,

I was wrong - you are right, it is caused by service ticket lifetimes being shorter than TGT lifetimes.
I didn’t know setting the service ticket lifetimes to not be less than
TGT lifetimes was a requirement. Neither does NetApp QA and I suspect,
neither do customers in general.

It's not a requirement. (Greg explicitly said "That said, your scenario
should work, and it doesn't." in his first message.)

Hi Ben

OK. This does mean that until this gets addressed, we will need to point this out to administrators.

Post by Benjamin Kaduk

Code was added to rpc.gssd, the NFS client agent that creates GSS
contexts for NFS, to take into account the clock skew and get a new TGT
before (now+clock skew). So if the service ticket lifetime is equal to
or greater than the TGT lifetime, then all is well.

I thought that having a keytab entry and a renewable TGT was enough.

I'm not sure why you would do both of these; if you're getting initial
creds with a keytab, there is no need to muck around with ticket renewal.

I wouldn’t, but QA and customers do.

refreshed by…?

Thanks for the info

—>Andy

Post by Benjamin Kaduk
-Ben

When an NFS request requires a GSS context, if the context does not
exist, is not valid, or if it is valid but the server replies to an RPC
request using a GSS context with an RPC error that indicates it’s side
of the GSS context has a problem, the client kernel does an upcall to
rpc.gssd which then decides if a new service ticket is required to send
an RPCSEC_GSS_INIT message to the server to create a new GSS context.
The resultant GSS context is stored in the client kernel with a lifetime
equal to the service ticket used to create it.
If rpc.gssd calls the code that refreshes the tickets from the keytab
when they are half way to expiring’ then that should mitigate the clock
skew issue.

Post by Greg Hudson
If so, we need to understand how
it works. It's possible (though unlikely) that changing the behavior of
gss_accept_sec_context() wouldn't be sufficient by itself.

_______________________________________________
https://mailman.mit.edu/mailman/listinfo/krbdev

_______________________________________________
krbdev mailing list ***@m

Greg Hudson

2015-10-07 14:45:08 UTC

Actually, setting the service ticket lifetime to be equal to (or greater than if this is possible) the TGT lifetime will not help. Just as in the example I sent, the application will get permission denied during the time difference between the client and server clock.

That is expected. What is not expected, in this variant, is that
gss_init_sec_context() will succeed by itself once the client believes
the TGT and service ticket to have expired. Apologies for any
miscommunication on this point.

There may be something in the calling code which refreshes the TGT in
this situation. If so, then to fully understand the scenario, we need
to know how the calling code decides when to refresh the TGT.

I opened a ticket about this issue here:

http://krbdev.mit.edu/rt/Ticket/Display.html?id=8268
_______________________________________________
krbdev mailing list ***@mit.edu
https://mailman.mit.edu/mailman/listinfo/krbdev

Greg Hudson

2015-10-07 15:08:38 UTC

—— from the ticket: ——
This unnecessarily strict check causes a particularly bad experience
when (a) the client's clock is slightly ahead of the server's clock,
and (b) the maximum service ticket lifetime is lower than the maximum
TGT lifetime.
—— ——
I think both a) and b) are incorrect.
for a) you got it backwards. this occurs when the server clock is ahead of the client clock.

Yes, I did write the wrong thing there; I will follow up on that.

for b) the relationship between the TGT lifetime and the service ticket lifetime is irrelevant. Only the service ticket lifetime has any effect as the client will use a valid service ticket to construct an RPCSEC_GSS_INIT request irregardless of the TGT lifetime value.

I will try one more time to communicate what I mean:

* If the service ticket end time is less than the TGT end time, then
gss_init_sec_context() fails during the clock skew window, and starts
succeeding again afterwards.

* If the service ticket and TGT have both expired (according to the
server), then gss_init_sec_context() fails, and keeps failing
afterwards, unless there is some out-of-band agent refreshing expired TGTs.

Put another way: we expect authentications to start failing around the
time the TGT expires. We do not expect authentications to start failing
around the time a service ticket expires, if the TGT is still valid.
That is what I refer to as a "particularly" bad experience.

If that isn't clear, perhaps we should ignore this as a moot point; it
doesn't really affect how we plan to change the krb5 code.
_______________________________________________
krbdev mailing list ***@mit.edu
https://mailman.mit.edu/ma

Olga Kornievskaia

2015-10-07 15:46:01 UTC

Yes, I did write the wrong thing there; I will follow up on that.

* If the service ticket end time is less than the TGT end time, then
gss_init_sec_context() fails during the clock skew window, and starts
succeeding again afterwards.
* If the service ticket and TGT have both expired (according to the
server), then gss_init_sec_context() fails, and keeps failing
afterwards, unless there is some out-of-band agent refreshing expired TGTs.
Put another way: we expect authentications to start failing around the
time the TGT expires. We do not expect authentications to start failing
around the time a service ticket expires, if the TGT is still valid.

Why not? This is not what should happen according to the theory of
Kerberos protocol. Let's use slightly generic terms, TGT is a
credential that proves client's identity to the KDC. TGT or it's
lifetime has no relevance in the context of authentication between the
client and a kerberized service, in this case an NFS server. Then a
service ticket is a credential that is used to prove client's identity
to the NFS server. The lifetime of the NFS service ticket should be
allowed to be valid within some configurable clock skew.

Post by Greg Hudson
That is what I refer to as a "particularly" bad experience.
If that isn't clear, perhaps we should ignore this as a moot point; it
doesn't really affect how we plan to change the krb5 code.
_______________________________________________
https://mailman.mit.edu/mailman/listinfo/krbdev

_______________________________________________
krbdev mailing list ***@mit.edu
https://mailman.mit.edu/mailma

Simo Sorce

2015-10-08 01:48:52 UTC

Yes, I did write the wrong thing there; I will follow up on that.

Why not?

Because technically the client can acquire a new ticket at any time if
the TGT is valid, but in this case instead it fails to acquire a new
ticket and fails the authentication.

Post by Olga Kornievskaia
This is not what should happen according to the theory of
Kerberos protocol. Let's use slightly generic terms, TGT is a
credential that proves client's identity to the KDC. TGT or it's
lifetime has no relevance in the context of authentication between the
client and a kerberized service, in this case an NFS server. Then a
service ticket is a credential that is used to prove client's identity
to the NFS server. The lifetime of the NFS service ticket should be
allowed to be valid within some configurable clock skew.

Yes, but this is not what Greg was referring to :)

Simo.

_______________________________________________
https://mailman.mit.edu/mailman/listinfo/krbdev

--
Simo Sorce * Red Hat, Inc * New York
_______________________________________________
krbdev mailing list ***@mit.edu
https:

Olga Kornievskaia

2015-10-09 14:37:48 UTC

Post by Simo Sorce

Yes, I did write the wrong thing there; I will follow up on that.

Why not?

Because technically the client can acquire a new ticket at any time if
the TGT is valid, but in this case instead it fails to acquire a new
ticket and fails the authentication.

Client should not be acquiring a new service ticket when it has a
non-expired service ticket according to its clock. It is the case that
the server thinks the ticket has expired because it has no slack for
clocks being skewed and that's incorrect.

It's not clear that this issue is agreed upon. Whether or not a new
service ticket is acquired later by the client is not in question.

If the server implements a reasonable clock skew policy, it will allow
for the client side code to detect that the service ticket has expired
and renew it. That functionality is properly working on the client
side.

Alternatively, client side code can be changed to take care of
receiving and properly handling CREDENTIALS_EXPIRED error on the
client side by acquiring a service ticket then which the code doesn't
currently do.

Post by Simo Sorce

Yes, but this is not what Greg was referring to :)
Simo.