From which zones LE is working?
You said they are querying from 4 different zones, do you know where they are located? are they those 4 zones may change?
Our service is running in Dallas, Sydney, Frankfurt, Washington, London, and Tokyo, if we will check the propagation from all zones will it increase the probability that the order will be finished successfully?
YourTestZone.IBM.com
Do you have that kind of pull/push/clout?
Otherwise, the name will have to be longer than that.
The point is to keep it as short as possible (no entirely by the name itself, but by the number of DNS systems that are involved - one delegates to another that delegates to another and then to another...)
Actually, the domain can be any domain, even a brand new one just for this purpose: YourNewTestDNSZone.info
That could keep the delegation to a bare minimum.
To get to any branch on the DNS tree, you will always have to start climbing up from the root/trunk.
.
.com
.com.ibm
.com.ibm.us
.com.ibm.us.texas.
.com.ibm.us.texas.test
You get the picture?
Each one of those could be a completely different set of DNS servers.
All it takes is one set to be problematic and you may not get the results you're looking for (consistently).
Ok, but we provide the service to users, so they are ordering certificates to a various domains... in case of IBM i know what is my root domain.... but with other domains, how will i know where to start diging from...
for example if a use order certificate for domain a.b.c.d.e what domain should i start with?
i will have to search the TXT record under all name servers until e and doing it on all zones?
OK maybe I misunderstood.
If users are free to request certs for any FQDN, then I'm missing something.
I thought this was all going to stay within your test zone(.x.y.z.IBM.COM).
No, we provide a service to order certificates to any domain.
the IBM domain mentioned in this thread is used in our test environment...we have a test environment which continuously orders certificates and validate that everything works as expected, We register new domains every day in Softlayer domain and order them certificates. in the day after we delete the previous day domains from and create new domains for the current day tests...
Actually im looking for a general solution that will 100% work.. no matter which domain is it.
In other word, we want to simulate the same DNS challenge test that LE performed. so when we ask LE to validate the challenge we can be in 100% be sure the test will pass also in LE side.
So, any suggestions for a better solution?
Is digging for the TXT record from several zones will solve it?
Will using the root domain name servers IPs in my DNS client will also improve it?
You may want to base your testing client on unbound, which is the resolver library used by LE too. I have the impression that it is quiet picky about the proper configuration of the DNS. Unbound has configuration parameters, which can widely affects its behavior. You want to match that with LE too.
Try to do the DNS record verification from as many location on the Internet as you can, to minimize the DNS result mismatch with LE due to different anycast location of the authoritative DNS server(s).
I don't think it's too helpful to know that, but the overall current list of locations is something like: Viawest Utah, Viawest Colorado, AWS eu-central-1, AWS eu-east-2, AWS us-west-2.
Did you try find the duration of time taken for your DNS updates to be visible globally? I think it would be helpful to have that baseline.
Lots of people in this thread (including myself) have jumped on the anycast-blaming bandwagon, but I'm still open to the idea that the cause could be something else. Like the way your zones are subdelegated back onto the same nameservers (which dnsviz complains about). Or maybe Softlayer has some anycast caching we're not aware of.
Trying to diagnose this would be better than immediately jumping to a solution based on guesswork. Even if the experiment is just seeing whether the problem goes away (and more importantly, if it does not) if you increase 40sā5m.
Having a DNS API which gives you consistency guarantees for changes (like Route53 does) would be something like what you want.
I don't think there's a foolproof generalized solution besides waiting and retrying, which random outages will make necessary anyhow.