Certbot Version & Invalid CA

(I am not a RabbitMQ expert, if a RabbitMQ expert you're comfortable with says I'm wrong about anything here, trust them or get a third opinion)

Zeroth point:

Please don't wait until the last minute to renew, as you've discovered it can take hours or days to figure out your best way forward, it'd be nice if you had say, a week, or four, to figure this out before expiry. Consider adjusting your automation to be more typical, waking up once a day, and renewing any certificates with only a month left on them is common practice, it's very little extra load for Let's Encrypt but much more peace of mind for you. Also consider adding monitoring that will alert you if important servers seem to have certificates which have say, less than a week until they expire.

First:

I suspect that

Indicates that the cluster members (not the RabbitMQ server) do not trust this new certificate, or at least they don't trust whatever they're shown by the server. In the TLS protocol, when a client is shown a certificate they can "alert" (the TLS term for a fatal error condition which closes the connection) unknown_certificate if they decide they don't like that certificate.

It's possible this is because the certificate chain you're delivering is messed up somehow (I think ssl_options.certfile should be the full chain from Certbot probably tls.crt)

But it's also possible those RabbitMQ clients just won't trust these new R3 certificates even if presented a valid chain. In that case you may need to diagnose the problem from the POV of those client systems, I assume even if some are controlled by customers you have test clients you can instrument/ diagnose.

Second:

RabbitMQ is probably a bit different from many situations discussed in this forum because it is intended to provide Mutual Authentication. So where a web site is configured just to prove who it is to web browsers (and the browsers don't prove who they are at protocol level, we still have stuff like "passwords" for that) the RabbitMQ server is both configuring this and also configuring how to decide if the clients are who they say they are.

Mutual Authentication is very nice, but it is an extra bunch of stuff to get your head around compared to say, HTTP or IMAP servers. I believe you have effectively disabled it here (verify = verify_none). That's fine (but I'll explain the consequences below)

The cacertfile is a trust store for CAs your server would trust to verify the connecting clients. If you were using verification, RabbitMQ would use cacertfile to decide which certificates are trustworthy. But you aren't so I believe any client which is able to connect to your server (this may be locked down by Firewall rules or equivalent) can talk to RabbitMQ. Again, not a RabbitMQ expert so there may be other authentication layers preventing this causing a problem.

So typically you'd not set this to a public CA (like Let's Encrypt) because chances are you don't want "Everybody on the public Internet, but, not anybody else" (who is that even excluding?) to be able to connect. You'd create an in-house CA and issue certificates only to remote systems you want to connect. This makes most sense for larger systems, connected over the public Internet but not publicly accessible. For example maybe you'd issue them per-customer the way you might give out authentication tokens for an API today.

But that's an whole extra adventure, mostly my point here is that this is a little extra confusing because RabbitMQ is allowing you to configure two fairly complicated things in one place.

2 Likes

Hi @tialaramex,

Thanks for taking the time on this thread - you have some good insights.

Fwiw, we always begin the cert renewal process early, this was just a special case (for an older platform), though a good point zero there for sure.

Some good thoughts on RMQ - we are still investigating a bit more about why a slow-roll of the pods was required in this case (and not in the past). Again, there are a few subtleties in running RMQ with TLS in Kubernetes as a StatefulSet with storage on persistent volumes. Normally, our renewal script (certbot + kubectl) simply outputs a single Kubernetes secret config that we can deploy (automated) to each cloud environment where our platform operates, and the ingress (load balancers) and RMQ pickup the change automatically and we're all set.

Fwiw, the LE change to the intermediate did not adversely affect our ingress in the cloud, it was only RMQ that had the issue. There appears to be some variance across RMQ versions with respect to TLS options (not surprising) - we are running an RMQ v3.7.15 image for this older platform.

Just to confirm again for any future reader, rolling the cluster (restarting each RMQ node) did ineed resolve the issue of:

TLS server: In state certify received CLIENT ALERT: Fatal - Certificate Unknown

Regarding (verify = verify_none ), yes, well said and understood. Due to the nature of our platform and the way our tenants consume data from the back of the pipeline, this is a requirement for us.

If we discover anything of further value, we'll follow-up to share with the community.

Once again, everyone's input has been great - thank you.

4 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.