kkapelon1 hour ago
> Your circuit breakers don't talk to each other.

Doesn't this suffer from the opposite problem though? There is a very brief hiccup for Stripe and instance 7 triggers the circuitbreaker. Then all other services stop trying to contact Stripe even though Stripe has recovered in the mean time. Or am I missing something about how your platform works?

rodrigorcs1 hour ago
Good question, that's exactly why the trip decision isn't based on a single instance seeing a few errors. Openfuse aggregates failure metrics across the fleet before making a decision.

So instance 7 seeing a brief hiccup doesn't trip anything, the breaker only opens when the collective signal crosses your threshold (e.g., 40% failure rate across all instances in a 30s window). A momentary blip from one instance doesn't affect the others.

And when it does trip, the half-open state sends controlled probe requests to test recovery, so if Stripe bounces back quickly, the breaker closes again automatically.

netik2 hours ago
This a great idea, but it's a great idea when on-prem.

During some thread, some where, there's going to be a roundtrip time between my servers and yours, and once I am at a scale where this sort of thing matters, I'm going to want this on-prem.

What's the difference between this and checking against a local cache before firing the request and marking the service down in said local cache so my other systems can see it?

I'm also concerned about a false positive or a single system throwing an error. If it's a false positive, then the protected asset fails on all of my systems, which doesn't seem great. I'll take some requests working vs none when money is in play.

You also state that "The SDK keeps a local cache of breaker state" -- If I've got 50 servers, where is that local cache living? If it's per process, that's not great, and if it's in a local cache like redis or memcache, I'm better off using my own network for "sub microsecond response" vs the time to go over the wire to talk to your service.

I've fought huge cascading issues in production at very large social media companies. It takes a bit more than breakers to solve these problems. Backpressure is a critical component of this, and often turning things off completey isn't the best approach.

rodrigorcs1 hour ago
I agree with more of this than you might expect.

On-prem: You're right, and it's on the roadmap. For teams at the scale you're describing, a hosted control plane doesn't make sense. The architecture is designed to be deployable as a self-hosted service, the SDK doesn't care where the control plane lives, just that it can reach it (you can swap the OpenfuseCloud class with just the Openfuse one, using your own URL).

Roundtrip time: The SDK never sits in the hot path of your actual request. It doesn't check our service before firing each call. It keeps a local cache of the current breaker state and evaluates locally, the decision to allow or block a request is pure local memory, not a network hop. The control plane pushes state updates asynchronously. So your request latency isn't affected. The propagation delay is how quickly a state change reaches all instances, not how long each request waits.

False positives / single system errors: This is exactly why aggregation matters. Openfuse doesn't trip because one instance saw one error. It aggregates failure metrics across the fleet, you set thresholds on the collective signal (e.g., 40% failure rate across all instances in a 30s window). A single server throwing an error doesn't move that needle. The thresholds and evaluation windows are configurable precisely for this reason.

Local cache location: It's in-process memory, not Redis or Memcache. Each SDK instance holds the last known breaker state in memory. The control plane pushes updates to connected SDKs. So the per-request check is: read a boolean from local memory. The network only comes into play when state changes propagate, not on every call. The cache size for 100 breakers is ~57KB, and for 1000, which is quite extreme, is ~393KB.

Backpressure: 100% agree, breakers alone don't solve cascading failures. They're one layer. Openfuse is specifically tackling the coordination and visibility gap in that layer, not claiming to replace load shedding, rate limiting, retry budgets, or backpressure strategies. Those are complementary. The question I'm trying to answer is narrower: when you do have breakers, why is every instance making that decision independently? why do you have no control over what's going on? why do you need to make a code change to temporarily disconnect your server from a dependency? And if you have 20 services, you configure it 20 times (1 for each repo)?

Would love to hear more about what you've seen work at scale for the backpressure side. That would be a next step :)

cluckindan1 hour ago
Instead of paying for a SaaS, a team can autoprogram an on-prem clone for less.
rodrigorcs1 hour ago
Totally possible, and some teams do. You need a state store, a evaluator job, a propagation layer to push state changes to every instance, a SDK, a dashboard, alerting, audit logging, RBAC, and a fallback strategy for when the coordination layer itself goes down.

It's not complex individually, but it takes time, and it's the ongoing maintenance that gets you. Openfuse is a bet that most teams would rather pay $99/mo than maintain that.

That said, a self-hosted option is on the near-term roadmap for teams that need it.

dsl3 hours ago
Now I have seen it all... a SaaS solution for making local outages global.
rodrigorcs1 hour ago
It makes the awareness global so instances stop independently hammering a service that the rest of the fleet already knows is down. You can always override manually too and it will propagate to all servers in <15s
whalesalad4 hours ago
const openfuse = new OpenfuseCloud(...);

what happens when your service goes down

kkapelon1 hour ago
It is answered in the FAQ at the bottom of the page

"The SDK is fail-open by design. If our service is unreachable, it falls back to the last known breaker state.

If no state has ever been cached (e.g., a cold start with no connectivity), it defaults to closed, meaning your protected calls keep executing normally. Your app is never blocked by Openfuse unavailability."

rodrigorcs1 hour ago
Yup, that is true for both Cloud and Self-hosted, it never blocks any executions by any external factors other than the breaker is KNOWN as open. The state sync and the hot path are 2 completely separated flows.
rodrigorcs1 hour ago
Feel free to take a look at the SDK code if you want to, it's open :) https://github.com/openfuseio/openfuse-sdk-node