Error when I try to generate certificate with traefikv2 acme tls challenge - Docker Swarm

Hi,

I try to get traefik v2 working with docker swarm with TLS-ALPN challenge in order to get certificates from let’s encrypt.

I have already tested like 20 differents configuration without manage to get certificates from tls ACME and dont understand why. I don’t think this is a problem about my traefik config but rather the network configuration because I’m not sure that let’s encrypt manage to connect through http://fqdn:443/ to get the information of the default certificate

I have already tested the httpChallenge but get an error too. I want to understand my errors on tls and http challenge so I think, I will create another post for my http challenge error.

ok let’s started:

I have a swarm cluster of three nodes with one traefik on each node and I have an OVH loadbalancer in frontend.

First of all: My entrypoint in my netowrk is ovh through load balancer

Front-end overview

Name
    lb-frontend-443
Protocol
    tcp
Port
    443

Name
    lb-frontend-80
Protocol
    http
Port
    80

Secondly: My frontend sends requests to farm servers which contains the three nodes docker

Name
    farm-443
Protocol
    tcp
Port
    443
Datacentre
Distribution mode
    Round-robin
Track session
    Source IP
Probe
    TCP
    Port
        443

Name
    farm-80
Protocol
    http
Port
    80
Datacentre
Distribution mode
    Source
Track session
    Source IP
Probe
    TCP
    Port
        80

Then, I have deployed on these servers traefik with port 80 and 443 bind to the host

I manage to connect to traefik dashboard

This is my docker-compose:

version: '3.7'

networks:
  traefik-public:
    external: true

services:
  traefik:
    image: traefik:v2.2
    hostname: "{{.Node.Hostname}}-{{.Service.Name}}"
    command:
    - '--configFile=/etc/traefik/traefik.toml'
    networks:
      - traefik-public
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /traefik.toml:/etc/traefik/traefik.toml
      - /certificate:/certificate
    deploy:
      mode: global
      restart_policy:
        condition: on-failure
      labels:
        - traefik.enable=true
        - traefik.docker.network=traefik-public
        - traefik.http.routers.traefik-router.rule=Host(`traefik.${DOMAIN}`)
        - traefik.http.routers.traefik-router.entrypoints=websecure
        - traefik.http.routers.traefik-router.tls=true
        - traefik.http.routers.traefik-router.tls.certresolver=letsencrypt
        - traefik.http.routers.traefik-router.service=api@internal
        - traefik.http.middlewares.default-compress.compress=true
        - traefik.http.middlewares.default-https.chain.middlewares=default-compress
        - traefik.http.routers.traefik-router.middlewares=traefik-auth
        - traefik.http.middlewares.traefik-auth.basicauth.users=${ADMIN_USER?Variable ADMIN_USER not set}:${HASHED_PASSWORD?Variable HASHED_PASSWORD not set}
        - traefik.http.services.traefik-services.loadbalancer.server.port=8080

This is my conf.toml

################################################################
# Global configuration
################################################################
[global]
  checkNewVersion = true
  sendAnonymousUsage = false

################################################################
# Entrypoints configuration
################################################################

# Entrypoints definition
#
# Optional
# Default:
[entryPoints]
  [entryPoints.web]
    address = ":80"
    [entryPoints.web.http]
    [entryPoints.web.http.redirections]
      [entryPoints.web.http.redirections.entryPoint]
        to = "websecure"
        scheme = "https"
        permanent = true
  [entryPoints.websecure]
    address = ":443"
    [entryPoints.websecure.http.tls]
      certResolver = "letsencrypt"

################################################################
# Traefik logs configuration
################################################################
[log]
  level = "DEBUG"
  format = "json"

################################################################
# API and dashboard configuration
################################################################
[api]
  insecure = false
  dashboard = true

################################################################
# ACME configuration
################################################################
[certificatesResolvers.letsencrypt.acme]
  #caServer = "https://acme-v02.api.letsencrypt.org/directory"
  caServer = "https://acme-staging-v02.api.letsencrypt.org/directory"
  email = "${EMAIL}"
  storage = "/certificate/acme/acme.json"
  [certificatesResolvers.letsencrypt.acme.tlsChallenge]
  #[certificatesResolvers.letsencrypt.acme.httpChallenge]
  # entryPoint = "web"

################################################################
# Docker configuration backend
################################################################

# Enable Docker configuration backend
[providers.docker]
  endpoint = "unix:///var/run/docker.sock"
  swarmMode = true
  network = "traefik-public"
  watch = true
 
  exposedByDefault = false

Traefik-dashboard

I’m not sure about the label - traefik.http.services.traefik-services.loadbalancer.server.port=443. I’m not sure about the connection of let’s encrypt. I guess, the let’s encrypt server will connect to my 443 host port of swarm server and will be bind to traefik port with 443:443.

This is my traefik logs:

{"level":"debug","msg":"legolog: [INFO] [traefik.demo.cloud.patrowl.io] acme: Trying to solve TLS-ALPN-01","time":"2020-04-07T17:34:25Z"}
{"level":"debug","msg":"TLS Challenge CleanUp temp certificate for traefik.demo.cloud.patrowl.io","providerName":"acme","time":"2020-04-07T17:34:29Z"}
{"level":"debug","msg":"legolog: [INFO] Deactivating auth: https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/47989242","time":"2020-04-07T17:34:29Z"}
{"level":"debug","msg":"legolog: [INFO] Unable to deactivate the authorization: https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/47989242","time":"2020-04-07T17:34:29Z"}
{"level":"error","msg":"Unable to obtain ACME certificate for domains \"traefik.demo.cloud.patrowl.io\": unable to generate a certificate for the domains [traefik.demo.cloud.patrowl.io]: acme: Error -\u003e One or more domains had a problem:\n[traefik.demo.cloud.patrowl.io] acme: error: 403 :: urn:ietf:params:acme:error:unauthorized :: During secondary validation: Incorrect validation certificate for tls-alpn-01 challenge. Requested traefik.demo.cloud.patrowl.io from 51.91.60.234:443. Received 1 certificate(s), first certificate had names \"ec5552cec6a19446c4eaf94ddd866262.82c4629185c6e4458ce087bec5fef363.traefik.default, traefik default cert\", url: \n","providerName":"letsencrypt.acme","routerName":"traefik-router@docker","rule":"Host(`traefik.demo.cloud.patrowl.io`)","time":"2020-04-07T17:34:29Z"}
{"level":"debug","msg":"Serving default certificate for request: \"traefik.demo.cloud.patrowl.io\"","time":"2020-04-07T17:34:30Z"}
{"level":"debug","msg":"http: TLS handshake error from 10.0.0.2:21000: remote error: tls: bad certificate","time":"2020-04-07T17:34:30Z"}

I tried to find some issues on let’s encrypt forum but didnt get usefull informations for

acme: error: 403 :: urn:ietf:params:acme:error:unauthorized :: During secondary validation: Incorrect validation certificate for tls-alpn-01 challenge. Requested traefik.demo.cloud.patrowl.io from 51.91.60.234:443. Received 1 certificate(s), first certificate had names \"ec5552cec6a19446c4eaf94ddd866262.82c4629185c6e4458ce087bec5fef363.traefik.default

traefik-service-443
Not sure about this config too.

To sum ip:

I have no problem with router - middleware - services about traefik but can’t figure out the acme with tls challenge. Always get default cert.
There is the url in logs where my server can be reached. It’s not import because it’s the demo instance and there is an auth basic http. I tried also without but I think, traefik let letsencrypt to get through.
If you have any clue, plz feel free to answer.

Thanks you and good luck during confinement

Hi @BDO

if you have that error, then the Letsencrypt servers are able to validate your domain. The servers from other networks are not.

Read

to learn something about the Multi validation.

But it’s possible this isn’t a problem of “Multi validation” (blocked ip addresses), instead, it’s a problem of your configuration.

HI @JuergenAuer,
Thanks for your reply.

I have read the lets encrypt topic about multi validation

Is it possible that the first server (letsencrypt data center) reaches the loadbalancer and the first traefik instance and after another server(cloud perspective) reaches the loadbalancer too but another instance of traefik so the challenge fail ?

I have a traefik v1.7 in production with acme-v2 also with a the same config loadbalancer in front and certificate generation is working with tls challenge.
Do you think I can do a test like whitelisting my domain from multi validation: https://forms.gle/9QN7dxALJVAoRjMKA

cf: https://letsencrypt.org/2020/02/19/multi-perspective-validation.html

Thanks again !

BDO

I would guess the problem is that you are not clustering Traefik: https://docs.traefik.io/v1.7/user-guide/cluster/

When the TLS-ALPN validation is being performed, there are going to be 4 challenge requests (due to multi-VA, as explained above).

Each of those 4 requests is going to arrive at any of the Traefik servers with a 1/3 chance (due to your round-robin proxying on port 443).

For this to succeed, every single one of the Traefik servers needs to know how to respond to the TLS-ALPN challenge.

Per your configuration, from what I can tell, only one of the Traefik servers - the one that initiates the certificate issuance process - knows how to respond. The other 3 servers are going to respond with the default certificate, because they have no idea about the certificate issuance request initiated by that 1 other Traefik instance.

I may have missed something - maybe you have configured clustering with KV storage etc - but I don’t see it in the info you’ve provided so far.

I’m not a k8s/Traefik user, so forgive any egregious mistakes I’ve made in this post …

Hi @_az,

Yeah, I think it could be the issue. I think lets encrypt servers don’t reach the same instance of traefik because of the load balancer in front. Traefik is configure in cluster with docker swarm (one instance is deployed per node) and to explain traefik behavior:
Each traefik on docker servers have a 443 and 80 port binding with the docker host that means the flux has this way:

Datacenter let's encrypt -> Loadbalancer OVH (frontend 443) -> Backend Ovh (Round Robin with source_ip on the same host docker - docker swarm [1-3] - port 443) -> Docker swarm [1-3] - Traefik services [1-3] (listen on port 443 of the host docker) -> traefik container [1-3] (services port 443).

So I think there is a problem with the ovh loadbalancer (no the traefik configuration) because of new sessions of let’s encrypt servers validation. I have a configuration with source_ip that keep session on the same host after but when a new server do another connection, ovh loadbalancer can’t known this connection has to be forwarded on the same server

So in theory, there is a solution like I find a way to recognize a challenge request from let’s encrypt and configure the ovh load balancer to forward the request to the same docker swarm to validate the challenge. But in practice, I don’t know how to do this ^^

If you have any idea how to perform it ?
Tell me if it’s not clear, I could explain in another way.

BDO

I’m not sure this is really viable. You would have to route based on the TLS ALPN extension in the CientHello message, and while some servers are capable of this (like haproxy), I don’t think it’s really something you want to get into doing. There’s also no guarantee that the same Traefik instance is going to initiate the ACME challenge every time.

I think the proper solution is that your Traefik state should be clustered, but it’s probably better to ask on the Traefik mailing list or issue tracker about the architecture required to do this - I don’t think there’s many experts on this forum.

You could also consider using DNS validation which is independent of your Traefik state entirely … if you have a known set of domains.

Isn’t it possible to use another solution?

You can create a subdomain acme.demo.cloud.patrowl.io.

That subdomain uses a fixed ip, not a load balancer.

Then add static redirects

http://traefik.demo.cloud.patrowl.io/.well-known/acme-challenge/random-filename

to

http://acme.demo.cloud.patrowl.io/.well-known/acme-challenge/random-filename

then run an ACME-client on that acme-subdomain and use http validation.

Letsencrypt follows these redirects (if they are port 80 / port 443).

So loadbalancing isn’t a problem because the validation is redirected to a static domain without loadbalancing.

Yeah I’m acutally reading the topic about dns challenge that could be easier to implement.
I just wanted to understand why http-challenge and tls-challenge failed every-time ^^. Now I think, I have understood ;).
Do you know some topic to automate the dns challenge ? Maybe best practice ?

Thanks

Oh it’s ok because traefik support auto-update of dns provider so I will configure dns challenge over my traefik client. https://docs.traefik.io/user-guides/docker-compose/acme-dns/
Thank you again to make me understand my mistakes.

BDO

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.