Are there any possible ways to bring down LE Staging environment for few seconds(or certain time) to test transient error handling scenarios on our service?
We tried with Pebble which I can manually bring down for testing but the problem I faced here is PEBBLE IS STATELESS and does not hold any data about the order that I placed before I brought it down so it fails to continue processing the order.
I would recommend resolving the staging API hostname to a specific (private) IP address and manage routing stuff like you recommended earlier on that specific IP address. So other services can make use of Cloudflare without being affected by the tests for staging.
That's not true, not in my case anyway. I can add hosts to /etc/hosts and the DNS resolving quickly changes to the new IP addresses. Perhaps my local DNS cache is very short, I dunno. YMMV.
Sounds like you are good at finding unintended features of the LE Staging environment.
Not a bad skill to have, remember with "great power comes greater responsibility".
What I'd do is put a proxy in front of Pebble and configure it to return 500s when you want it to. A simple way would be, when you want it to be 'down', put the wrong address in the proxy field so it doesn't actually reach Pebble. Reconfiguring or restarting the proxy won't reset Pebble.
I would make a deny rule (above my accept rule) that kicks in on only specific hours of the day (or on certain days of the week).
That way you can predetermine and structurally schedule all your random outages - LOL
In this "example", we can see how access to all defined LE networks can be dropped at midnight and four AM (for one hour) and also during the weekend [48 hours - all Saturday and Sunday]
As @osiris suggested, any easy method to simulate a particular resource being unavailable is to append/remove to your hosts file with a fake IP for the host. You didn't mention what your test environment is so I'm guessing it's a linux based CI/CD platform.
FWIW, nginx will let you test the existence of a file (and other objects) on the operating system during the request, and act appropriately (docs):
if (-f /path/to/semaphore) {} # file exists
if (!-f /path/to/semaphore) {} # file doesn't exist
This is a lightweight check, because it's leverages some operating system caching. It's often used to set a "downtime" flag, but I use it often on tests -- you can just return a custom error for a flag.
Another option is to use OpenResty, which is a fork of Nginx that integrates server-side scripting with Lua. It would be trivial to create an endpoint to toggle "proxy on/off", and with a bit more work you could simulate custom network conditions. I have several test systems that use this approach.