Certbot works on staging but fails with prod


#1

My domain is: sonar.digital.arcadiagroup.co.uk

I ran this command:
certbot certonly --standalone --preferred-challenges http --email vaild@email.co.uk --agree-tos --no-eff-email --http-01-port 888 --post-hook=‘service h
itch reload’ --renew-hook=’/usr/local/bin/hitch-renew-hook’ -d sonar.digital.arcadiagroup.co.uk

certbot certonly --standalone --preferred-challenges http --email vaild@email.co.uk --agree-tos --no-eff-email --http-01-port 888 --post-hook=‘service h
itch reload’ --renew-hook=’/usr/local/bin/hitch-renew-hook’ -d sonar.digital.arcadiagroup.co.uk --staging (works fine)

It produced this output:
Failed authorization procedure. sonar.digital.arcadiagroup.co.uk (http-01): urn:ietf:params:acme:error:connection :: The server could not connect to the client to verify the domain :: Fetching http://sonar.digital.arcadiagroup.co.uk/.well-known/acme-challenge/NcIzxQbaRJwBW5Iyl6cbPYAibIAyz112AdSwhvpBY6k: Timeout during connect (likely firewall problem)

My web server is (include version): hitch 1.4.8

The operating system my web server runs on is (include version): AWS Linux 2

My hosting provider, if applicable, is: AWS

I can login to a root shell on my machine (yes or no, or I don’t know): yes

I’m using a control panel to manage my site (no, or provide the name and version of the control panel): no

Staging version works fine, when I try to get a production cert it fails with the above error and I can’t see the request from acme-v02.api.letsencrypt.org, during the time its running there are no inbound requests logged.

Any idea why the production api endpoint can’t reach the domain?


#2

Did you see incoming requests during the staging validation? Neither staging nor production will actually make a validation request if it already has one cached for the name and account you’re using. So it may be that your server is just unreachable generally.


#3

Also, you’re probably aware of this (since you obviously completed at least one staging validation successfully) but just in case: --http-01-port only tells certbot what port to listen on, it doesn’t affect the behaviour of the CA which always connects on port 80 so you still need to ensure that the relevant requests get forwarded to the standalone certbot web server.


#4

I can see the staging request just fine, there is nothing for the production


#5

Any idea what I can do to debug this?

From the look it seems to be an issue with the production letsencrypt endpoint since it works as expected with staging.

Is there any reason why the production would not make a request to the auth origin?


#6

The different behaviour between stagging and production is odd.

But I may have a hint:

Your DNS says:

sonar.digital.arcadiagroup.co.uk. 299 IN A 63.32.31.128
sonar.digital.arcadiagroup.co.uk. 299 IN A 63.32.12.54
sonar.digital.arcadiagroup.co.uk. 299 IN A 34.243.206.86

But it seams only 63.32.12.54 is valid, can you confirm?


#7

Hi @jpd4ag

your configuration looks buggy.

Checking your root

D:\temp>download http://sonar.digital.arcadiagroup.co.uk/ -h
Error (1): Der Remoteserver hat einen Fehler zurückgegeben: (503) Server nicht verfügbar.
ProtocolError
Retry-After: 5
X-Varnish: 56244
Age: 0
Connection: keep-alive
Content-Length: 282
Content-Type: text/html; charset=utf-8
Date: Wed, 31 Oct 2018 23:39:34 GMT
Server: Varnish
Via: 1.1 varnish-v4

Status: 503 ServiceUnavailable
503

21156,11 milliseconds
21,16 seconds

A 503 - http status.

But checking the file Letsencrypt want’s to get:

D:\temp>download http://sonar.digital.arcadiagroup.co.uk/.well-known/acme-challenge/NcIzxQbaRJwBW5Iyl6cbPYAibIAyz112AdSwhvpBY6k -h
Error (1): Die zugrunde liegende Verbindung wurde geschlossen: Die Verbindung wurde unerwartet getrennt…
ConnectionClosed
3

21295,44 milliseconds
21,30 seconds

A “ConnectionClosed” - error (transport level). The first try needed 0,24 seconds, the second more then 20 seconds. Looks that one instance has a timeout, so the next instance answers.


#8

The url will timeout due to certbot is not running, so it will throw a 503. If I throw up a mock web server to fake the response it works fine.

I have currently a single server behind a AWS NLB which provides the fix IPs.