Cannot connect to acme-v2.api.letsencrypt.org from web server

Yesterday my organization renewed our certificates using wacs.exe, through which we were able to connect to acme-v2.api.letsencrypt.org without issue. Today, that is not the case. We have been through every similar post I could find but they were either closed without resolution or did not resolve our issue. Please shed any light you can on this, we are currently at a loss as to what is going on.

My domain is: (www.)itmmarketing.com (we have multiple subdomains, all with valid certificates)

I ran this command: ping, tracert, openssl s_client -connect, wacs.exe, opened in browser

It produced this output:

1 * * * Request timed out.
etc.

  • openssl s_client -connect:

12160:error:0200274C:system library:connect:reason(1868):crypto\bio\b_sock2.c:110:
12160:error:2008A067:BIO routines:BIO_connect:connect error:crypto\bio\b_sock2.c:111:
connect:errno=0

  • wacs.exe: MicrosoftTeams-image (4)
  • opened in browser: ERR_CONNECTION_TIMED_OUT

No such errors occur when using acme-staging-v02.api.letsencrypt.org.

My web server is (include version): IIS Version 10.0.14393.0

The operating system my web server runs on is (include version):Microsoft Windows Server 2016 Version 1607

My hosting provider, if applicable, is: AWS

I can login to a root shell on my machine (yes or no, or I don't know): Yes

I'm using a control panel to manage my site (no, or provide the name and version of the control panel): No

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot): Windows ACMEv2 client version 2.1.18.1119

That's interesting, it seems like your packets are getting dropped somewhere along the way. Are you able to successfully ping other hosts from the same machine? Did anything change about your network topology, firewall, or routing recently? Do you have a firewall that restricts egress?

1 Like

Yes we can ping other hosts from the same machine, including acme-staging-v02.letsencrypt.org. Nothing has changed since yesterday when we were able to successfully connect. We do not restrict egress, all outbound traffic is allowed.

Actually, I'm having the same problem, but only from some machines and/or networks, not all of them. I believe this points to a problem on the other side of the magic cloudflare performs. I verified via netstat and firewall logs that TCP sessions are being created, and after waiting a really long time, a browser attempting to load https://acme-v02.api.letsencrypt.org/directory from one of the problem machines returned this:

The website’s security certificate is not yet valid or has expired.
Error Code: DLG_FLAGS_SEC_CERT_DATE_INVALID

I had problems similar to this with certs renewed on 9/28, and they automatically renewed again on 9/29, resolving the problems. I suspect at least one endpoint that cloudflare URL forwards to has a certificate with an invalid chain due to the expiration discussed here:

2 Likes

I'm investigating this now.

Edit: All staging/prod loadbalancers are serving the following chain, the example below is from prod as shown by the CN.

Certificate chain
 0 s:CN = acme-v02.api.letsencrypt.org
   i:C = US, O = Let's Encrypt, CN = R3
 1 s:C = US, O = Let's Encrypt, CN = R3
   i:C = US, O = Internet Security Research Group, CN = ISRG Root X1
---

Is it possible that we are being blocked for some reason, similar to this post? I can't imagine why, but we are at a total loss as to what else the problem could be at this point.

PM me IP addresses and I will see if there's anything in the Cloudflare block list.

Edit: The IP address was not found in the block list.

3 Likes

@bb_dev welcome to the LE community forum :slight_smile:

Have you ruled out local firewall and antivirus?

1 Like

I believe so. We were able to connect without issue on Thursday and nothing has changed in either. If there is any additional information I can provide to help troubleshoot this, please let me know. We still have no clue what the problem is at this time.

In my case, it's starting to look like firewall, but I'm not sure why allowing the IP for acme-v02.api.letsencrypt.org isn't sufficient, nor am I sure why that was sufficient prior to the end of last week (certificate updated fine on 9/30, but client started failing to connect on 10/1. Nothing relevant has changed in my firewall config, but a server with more permissive access through the same egress point is not demonstrating the same behavior. In firewall logs, I can see several blocked attempts to access the Akamai CDN, but I don't know if those are from the client or just normal Windows stuff. Did LE just move to cloudflare? That might explain why I'm seeing the wrong certificate if only allowing the LE IP is no longer permissive enough.

What does that mean exactly?

I was just editing to clarify, but I'll reply here instead since that will be less confusing now. the firewall in question is a FortiGate. The FortiGate is configured to allow all traffic destined for FQDN acme-v02.api.letsencrypt.org. In theory, this should allow for changes in that IP, but only via updates to DNS. So far, the IP has been consistent when I have performed nslookup from various endpoints. That IP appears to be a cloudflare IP, and it is the only IP this server is allowed outbound access to. This was working fine on 9/30, but doesn't seem to be working anymore since 10/1. I have no idea whether or not the DNS pointed to a cloudflare IP on 9/30.

1 Like

Thank you for this link, but we are not using SSL inspection.

I don't think the firewall is the issue. I changed the firewall path in question to reject packets instead of dropping them and now the "invalid certificate" warning comes up much sooner. IE won't let me look at the certificate, so I found an old portable copy of FireFox and tried it. The first certificate rejected was an LE certificate, valid from 9/29 to 12/28 with CN = acme-v02.api.letsencrypt.org and serial number 03:89:47:BF:CA:58:B2:C9:C8:83:7B:31:2B:70:72:88:12:2E. The R3 certificate is shown at the top of the chain in this browser's interface. Said interface is also showing these messages:

Certificate Status
This certificate attempts to identify itself with invalid information.

Unknown Identity
Certificate is not trusted, because it hasn't been verified by a recognized authority.

I exported the certificate from the browser and viewed it with Windows; it shows the "DST Root CA X3" certificate as the problem. This is the same issues my servers had before renewing a second time. This is the expired root certificate.

When I tried to go through this in the browser again, I had the same issues with another certificate. This certificate shows serial number 03:54:66:A2:2D:FB:A4:DC:6F:4B:8F:BF:9D:72:DD:44:BB:CE

Granted, the old version of Firefox has its own old certificate store / isn't using the Windows store, and it does seem odd that one machine would consistently get bad certificates and the other would consistently not, so maybe this is a configuration issue on my end? Could this comment be relevant? I think I took that action on the machine that is working, but I think it was working before I did (and I still haven't rebooted it). Also, I just tried it on the machine that isn't working (except the reboot part), and that machine continues to exhibit the same behavior.

Separately, while the Fortinet document linked above doesn't apply to me, it does imply that LE has done something wrong. Specifically:

Let’s Encrypt took the additional step of cross signing their root CA into the chain of trust. In doing so, they signed the cross-signed root for longer than the lifespan of the IdentTrust DST Root CA. Their intention was for it to remain valid after the Signing Root CA expired. But as described by Scott Helme, it’s a “sneaky move but it does seem to fall within the rules.”

It goes on to talk about why that works for Android and not Fortinet, but the reason it doesn't work for Fortinet may also apply to Windows (except that doesn't explain why my other machine is having no issue).

Have you run wacs.exe with --verbose? I see this when I do:

System.Net.Http.WinHttpException (80072F8F, 12175): Error 12175 calling WINHTTP_CALLBACK_STATUS_REQUEST_ERROR, 'A security error occurred'.

I think this supports the theory that the wrong certificate chain is being followed. I suspect it's a Windows thing, and I'm still trying to figure out exactly what Windows thing.

1 Like

This is the output I get when I run wacs.exe with --verbose:

[EROR] Initial connection failed, retrying with TLS 1.2 forced
System.Threading.Tasks.TaskCanceledException: The request was canceled due to the configured HttpClient.Timeout of 10 seconds elapsing.
---> System.TimeoutException: A task was canceled.
---> System.Threading.Tasks.TaskCanceledException: A task was canceled.
at PKISharp.WACS.Services.ProxyService.LoggingHttpClientHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.HttpClient.SendAsyncCore(HttpRequestMessage request, HttpCompletionOption completionOption, Boolean async, Boolean emitTelemetryStartStop, CancellationToken cancellationToken)
--- End of inner exception stack trace ---
--- End of inner exception stack trace ---
at System.Net.Http.HttpClient.SendAsyncCore(HttpRequestMessage request, HttpCompletionOption completionOption, Boolean async, Boolean emitTelemetryStartStop, CancellationToken cancellationToken)
at PKISharp.WACS.Clients.Acme.AcmeClient.CheckNetwork()
[DBUG] Send GET request to https://acme-v02.api.letsencrypt.org/directory
[EROR] Unable to connect to ACME server
System.Threading.Tasks.TaskCanceledException: The request was canceled due to the configured HttpClient.Timeout of 10 seconds elapsing.
---> System.TimeoutException: A task was canceled.
---> System.Threading.Tasks.TaskCanceledException: A task was canceled.
at PKISharp.WACS.Services.ProxyService.LoggingHttpClientHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.HttpClient.SendAsyncCore(HttpRequestMessage request, HttpCompletionOption completionOption, Boolean async, Boolean emitTelemetryStartStop, CancellationToken cancellationToken)
--- End of inner exception stack trace ---
--- End of inner exception stack trace ---
at System.Net.Http.HttpClient.SendAsyncCore(HttpRequestMessage request, HttpCompletionOption completionOption, Boolean async, Boolean emitTelemetryStartStop, CancellationToken cancellationToken)
at PKISharp.WACS.Clients.Acme.AcmeClient.CheckNetwork()

Based on that output, yours looks like a timeout, mine didn't. I was able to resolve mine, and I'm going to document that here, but I'm afraid it's not going to help you. That having been said, I notice your tracert shows packets being dropped at the very first hop. Are you able to successfully tracert other addresses from that machine?

Regarding my issue, once I suspected that it had to do with a stale store on the Windows server (store would be stale because server has no Internet access), and after Windows updates and reboots didn't help from the comment I linked to in my last post, I proceeded to try to figure out how to get the cert store updated manually. I found this page and tried to follow the instructions under the The List of Root Certificates in STL Format header. Naturally, there was no "Install" option on the right-click, but the certutil command worked. unfortunately, that file didn't include the ISRG certificate. I was able to get things working because I have another server tat is working, so I expected the ISRG and R3 certificates from it and then imported them to the problem server. This solved my problem. I should imagine there is a better way to do that, and it is likely documented. If so, it would be great if someone linked it here. Meanwhile, I'm up and running again.

2 Likes

Yes, the server can successfully tracert other addresses including acme-staging-v02.api.letsencrypt.org

If the first numbered line of tracert for acme-v2.api.letsencrypt.org is

1 * * * Request timed out

but the first numbered line of tracert for acme-staging-v02.api.letsencrypt.org is more like

1 #ms #ms #ms <fqdn or ip of first hop>

then your problem is at or before the first hop, and that's where you need to be looking for it. The only explanations I can think of to explain that scenario would be something in your router/firewall explicitly dropping that destination or a bad route in your server's routing table. I mean, under normal circumstances. I'm assuming you haven't configured a local interface to use the LE IP and drop ICMP packets and your hosts file isn't mangled in such a way that the IP is being translated into something else that's getting dropped. Regardless, additional tracert output may be necessary for further troubleshooting, and it may be wise to post a parallel inquiry about this discrepancy on a site where networking is the focus.

1 Like