Why services are needed
One of two main reasons:
- You have multiple independent teams so the idea is to separate services to make independent progress and less alignment / communication / blockers.
- One of the components in the application has very different / higher scaling requirements than others, which makes scaling the entire application difficult. Hence we move small parts to a separate service.
When services are designed for scale, please note:
- Data is the most critical part of services
- Compute is cheaper than database
- Scaling is constrained by database not code
If you are creating a new service which is not for scaling teams or scale or requires a separate data repository and ownership, stop.
- Database is sacred (Actually data is).
- Never consider fixing data as the first solution; its last option.
- Always keep your state management system stable.
- Reject a request as soon as possible.
- Any request SMS (State management system) accepts goes to the database.
- Always instrument your code
What is Scalable
- Being able to handle spikes of incoming requests and respond within defined SLA / SLOs.
- Ability to serve more customers / requests / transactions by adding more hardware. (Horizontal scaling)
Problems with Services
- Multiple modes of failures
- Network issues, timeouts etc
- Increased and random latency issues (More reading)
- Synchronous calls across services break service isolations and create high coupling.
- Within a service
- High Cohesion
- Across services
- Low coupling
Async Services to Rescue
What is required :
- Service communication will at some point of time fail / slow down (timeouts)
- Services also need time to recover despite being HA.
Requirements for Async:
- Replayable queues
- Embarace DLQ
- Circuit breakers (Hysterix / resilience4J )
- Eventual consistency
- Multiple redundancy (cache / db / prepared views / long term storage / data lakes)
- Requires Domain Knowledge.
- Design services around domains not functions.
- Obviously don’t copy everything. 🙂
- Excess never works.
- Be pragmatic
- Organize services based on Root aggregates and bounded context.
- Every domain state transition triggers a domain event and other services respond to the event.
- Some duplication is perfectly okay in distributed systems, it’s ok to keep a copy of immutable data; over making sync network calls.
- More often, systems don’t fail because of duplication, they mostly fail because of dependency on external resources.
- If your service requires you to call another service to perform its most critical function, congratulations you have designed a distributed monolith.
There are two types of services that can be built:
- Core services
- These are your domains. Ex: Transfer / Beneficiary …
- These services have state and work as state machines.
- Should only be responsible for three things:
- Validate incoming request (Balance / business rules)
- Persist request
- Manage and update state
- All side effects of persistence which can be eventually consistent should be moved out of transaction and processed later using retryable queues
- Actors work on messages, they can either
- Process an incoming message (SMS / Email)
- Or convert message to a different message and send message (Splitting payment request into credit and debit)
- These are stateless actors, they should not manage state
- They can persist messages for idempotency of any internal aggregation but should not have their own state
- Actors work on messages, they can either
Actors are purely scalable components as if we need to process more messages we can always achieve it by adding more workers, or actor instances. Since they are stateless it’s easier to scale up or down once the backlog is cleared.
To achieve the above design we need two components:
- Reliable service bus
- Downstream Actors (services)
There is additional challenge here of maintaining the source of truth and source of events in sync; that is:
- Event should be and must be emitted once transaction is successful
- Event should not be sent if transaction to DB fails
If this is not done, there would be spaghetti design consisting of millions of if checks in code and impossible to manage state transitions.
For services which emit domain events this is usually achieved with go routine / sidekiq or thread with retries and graceful failure handling. Which works for most cases but there is no way to fix it if the service crashes (OOM / segfaults from platform and tons of other issues).
Keep paths to updates as little as possible 🙂
If you want infinite scale, make state management someone else’s problem.