Profile selection

Anyone else having problems with the new profile selection feature?

I get the following error (staging and production):

Error creating new order :: unrecognized profile name "classic" (urn:ietf:params:acme:error:invalidProfile)
3 Likes

I'm seeing the same. Staging directory endpoint currently advertises both classic and tlsserver, but accepts neither. Prod advertises only classic but doesn't accept it either.

Not sure what's going on.

4 Likes

Boulder on production currently is running eda49660, which is fairly recent. A few commits back we can see some profile related commits.

Maybe some bug was introduced.

3 Likes

Thanks for the confirmation!

2 Likes

There were some changes around handling profiles, so it is definitely possible something has broken. It’s the weekend so I think it likely won’t be resolved til Monday.

6 Likes

But I assume the working of profiles is tested in some integration test?

3 Likes

As far as I can tell, the problem is not in the code but in the config (which we don't see).

When using the "classic" profile for example:

The call to

wfe.validateCertificateProfileName(newOrderRequest.Profile)

succeeds.

In the next step

ra.profiles.get(req.CertificateProfileName)

fails with

Error creating new order :: unrecognized profile name "classic"

So it seems the Web Frontend (wfe2/wfe.go) has the profiles configured but the Registration Authority (ra/ra.go) has not.

3 Likes

If the config somehow has disabled profiles or not configured the classic profile, why is it announced in the directory? Sure also sounds like a bug in the code to me :slight_smile: Unless perhaps there are two separate places in the configuration where profiles are "enabled", but that sounds kinda strange to me.

I would also assume integration tests are using (mostly) the same configuration as staging and/or production.

2 Likes

There is more than one config.

The Web Frontend uses c.WFE.CertProfiles, which is read from "wfe2.json".

The Registration Authority uses c.RA.ValidationProfiles, which is read from "ra.json".

This is all in the "config-next" folder. However in the "config" folder itself there is the "validationProfiles" key missing in ra.json:

The directory works since it belongs to the Web Frontend.

2 Likes

And there isn't a check in place that makes sure certain items of both JSONs are congruent? :thinking: Guess not.

I would think Boulder would refuse to stop/reload/restart/start if the configurations aren't compatible with each other.

2 Likes

there could be time it'd diverge, like when a profile is retired it'd removed from wfe but it would have in RA side until old orders with that profile expires

2 Likes

Have left a suggestion on the boulder repo for additional automated production testing to cover this: Implement Testing in Production to cover new feature sets · Issue #8001 · letsencrypt/boulder · GitHub

4 Likes

Hi folks, we've tracked down the bug, and indeed it was a deployability bug -- some of the code deployed in this past week's code deploy inadvertently expected certain configuration to be in place, when that config was not yet in place.

For the curious, the bug is right here:

func (vp *validationProfiles) get(name string) (*validationProfile, error) {
	if name == "" {
		name = vp.defaultName
	}
	profile, ok := vp.byName[name]
	if !ok {
		return nil, berrors.InvalidProfileError("unrecognized profile name %q", name)
	}
	return profile, nil
}

This function should have had an additional check at the top:

	if vp.defaultName == UnconfiguredDefaultProfileName {
		return vp.byName[vp.defaultName], nil
	}

to ensure that all requested profiles are accepted and given the default values, as long as the RA's configuration hasn't yet been updated with specific per-profile settings.

I've put together a fix. Thanks for your patience, everybody!

5 Likes

Good analysis! Note that the configs found in the config and config-next folders are not the configs that our ACME Server is deployed with: they're test configs for Boulder's integration tests. So while they are intended to be similar to our prod (config) and staging (config-next) configuration, one can't generally use them as an indication of exactly what the current production configs look like.

4 Likes

If boulder services refused to start unless their configs were congruent, we'd never be able to do rolling-restarts where individual components (and even individual replicas of each component) are brought up with new config one-by-one so we don't have any downtime during a deploy.

So instead we defend against these with the config-loading code. In this case, you can see that the RA was made happy to start with either the new-style or old-style configs, because we know that we can't deploy the new-style configs at the exact same instant as the new code. I just happened to write a bug a little bit further down the line, and didn't catch it in unit or integration tests.

5 Likes

Hm, makes sense. I shouldn't compare Boulder with stuff like Apache and nginx, which are (as far as I know in my little home server environments) single component services.

1 Like

What should have happened here is that our automated issuance testing should have caught that it couldn’t issue certificates, but we haven’t yet got it using profiles yet. That’s on the TODO list before we are ready to launch multiple profiles in production.

8 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.