while messing with this I just realized this is literally forging acme challenge reply out of band, just happens to be on real web server and middleman happens runs on same box.
not sure how picky LE (and golang as that's what calls ) catches oddity on TCP layer, and how well we forged in this challange
btw can I ask why it used iptable command if nfqueue (from name) requires to run nftable?
From what I've observed, the iptables
CLI is more ubiquitous than the nft
CLI on most Linux distros, even though it is usually just a frontend into the nftables
backend.
Certainly there are a number of equivalent methods of adding and removing the required rule, and a production version of such a plugin should probably support a couple of them. As well as checking for/loading the nfnetlink_queue
module etc.
ported equivalent solver for lego:
nfqueue dangling session tcp.pcapng (2.6 KB)
this makes TCP we steal a packet from session fail to close, and looks like that dangles for 2~3 minutes, and create a bunch of TCP retransmission from acme sever side. before release this we should close TCP channel on both side: wonder if sending RST to both side will be enough, or need something else? or just wrtie another packet with RST+ASK?
sending RST to server will make server to not care about this session, but our webserver sill pings the we ACME server. VA doesn't care though
not sure how to write packet to back to webserver or so
Definitely it should take over the TCP session and do a graceful FIN
/ACK
for the connections from the ACME server.
At that point it may well be convenient to reuse the same logic for the connection to the local webserver, though I don't think it's a big deal to RST
that one.
So is the local webserver aware of this? I assumed these packets would never hit it, and there wouldn't be any connection to the local backend.
The takeover only kicks in when the HTTP request comes in, so the local webserver is aware of an open TCP connection without any data sent yet.
when we catch the acme request there is already a connection between acme VA and our backend, as until TCP handshake finished and start sending payload there is no way to know firewall to this is for acme. when we inject http reply we increase seq from that session by len(reply), which our main webserver doesn't know it did, and both side (unless we sent rst and close session) send retransmission but can't talk because seq/ack no longer match.
not sure how to inject packet to our backend with forged src address though
Ah, that makes sense.
IIRC, closing this should not matter much on Nginx (it can handle many slow/orphaned connections) but this can seriously degrade performance on Apache and would be needed there. I'm not sure of other platforms.
After playing with this for a while, it does seem to be a problem.
- Can't forge FIN/RST towards the backend with raw packets, because it bypasses the kernel's network stack. The kernel ends up thinking the connections are still
ESTAB
and things likeepoll
then don't work properly. - Been trying to use nfqueue's mangle, but either I am screwing up the packet and it's being ignored, or it's too late in the networking stack processing pipeline and it's not actually possible to alter the connection state at that point. It's probably possible to directly delete the connection state using
libnetfilter_conntrack
, but then the hackery involved is getting wildly out of control.
At least, forging FIN/ACK to the ACME servers seems to work, but ofc retransmits still come from Linux.
This is what I've been trying.
This may not be relevant, but I am bringing it up just-in-case. Many years ago, I ran into the issue of Python's requests
library not being able to give me actual information about the connection - which caused a lot of blockages/issues in troubleshooting. I eventually realized the cause of our problems were domains that had multiple DNS records, and we had no way to determine what IP address we connected to (our issue) OR what their SSL Certificate was (another group's issue that was essentially the same as ours, and we eventually needed).
The underlying reason for this, was the manner in which requests
utilizes urllib3
, and that urllib3
closes the socket connection without logging any info or offering hooks to capture data. Suggested "workarounds" all involved a second connection, which is not guaranteed to be similar to the first. We eventually found a workaround technique for persisting IP data, but could not persist the SSL Certificate data without a fork or monkeypatch. urllib3
and requests
are open to a new debug object, but no one involved had enough time to fully spec this out and get enough consensus to generate a PR that would be accepted.
Anyways, my suggestion is to check the fnfqueue
source to see if they are closing something or just not persisting some variable or connection. There might also be an opportunity for a new hook.
This was the issue and it's fixed now. Probably the seq
did not add up when I stripped the payload. If I set only the RST
flag to the inbound packet and mangle it, everything works OK and the connection gets immediately closed. nginx (or whoever runs on port 80) sees a connection reset, even via epoll
.
It can be done "more properly" but I'm happy for now, no more rogue retransmissions.
Should work the same in the Go nfqueue
library I think, mangle the packet to add RST
flag.
So what remains is:
- Check whether IPv6 support needs any changes
- UX around having the right netfilter module loaded and nfqueue library installed
- ...
for ipv6: hw_protocol will hold Ethernet frame protocol info, 0x0800 for ipv4 and 0x86DD for ipv6
I kinda feel like it should be fail safe, that even when certbot killed before cleanup is called we should ensure we don't left firewall rule on and block webserver: if client is killed but nfq rule is still there than every traffic to port 80 will be droped
looks like nfqueue have --queue-bypass
which will packet to pass though if nothing listening on queue, we should add this, but duplicate rule still mess next renewal as we will send two reply.
change token with zeros when we pass with RST will work though
edit:: adding --queue-bypass makes it not send any packet to us hmmm
edit2:: it was my code take to long to process, optimizing it fixed
Ah, that's very cool and worth using. Nice find. Works for me.
I also found this note:
This feature is broken from kernel 3.10 to 3.12: when using a recent iptables, passing the option
--queue-bypass
has no effect on these kernels.
what should it do when there is no webserver running on that port? as is challenge will fail because kernel will send RST so we don't get http traffic. there would be 3ish options
- just let it fail the challange
- test port binding and fail with message to use normal standalone
- we call normal standalone mode solver
- we bind port 80 ourselves so something is listening (but this sounds really roundabout way)
I like this one because it catches users who might be holding the plugin wrong, early on. I've applied it.
I also replaced the iptables
invocation with pure Python netlink
code. Right now I think it only depends on the kernel module, not any C libraries (other than libc). But time will tell when I try test on some older distros.
Hm, looks complicated
I don't see the variable port
being used in the expression, except earlier for removing the table, is that normal? I don't see 80 (or 0x50) anywhere for that matter. Maybe I'm blind
Oops, nice catch. b"\x00P"
is 0x0050
. The API takes bytes
for some reason, I don't know. Just virtual machine things. Fixed!