Architecture
The architecture of Allied Telesis (AT) Cloud Services is designed to support rapid development, scalability, and maintainability through a set of guiding principles and infrastructure choices. Central to this architecture is the adoption of a loosely-coupled and stateless service model, enabling teams to work independently and deploy changes with minimal system-wide impact. Services are sized according to a 'Goldilocks' principle-neither too small to create overhead nor too large to hinder testing and comprehension.
Key architectural rules of thumb promote clarity, resilience, and clean boundaries between components, ensuring that systems are easy to understand, debug, and evolve. Infrastructure components such as externally-managed databases, authentication via Single Sign-On, and secure inter-process communication are foundational to the platform.
Deployment strategies accommodate independent lifecycles for different products, and the architecture remains a living document, evolving through a collaborative process involving engineering teams and a cross-regional steering committee. This ensures consistency, adaptability, and alignment with organizational goals.
Key Principles Underpinning AT Cloud Architecture
A loosely-coupled architecture
"Loosely coupled applications allow teams to develop features, deploy, and scale independently, which allows organizations to iterate quickly on individual components. Application development is faster and teams can be structured around their competency, focusing on their specific application." - CNCF
The inverse, a tightly coupled architecture, makes it difficult to deploy changes without affecting many parts of the system.
"With a tightly coupled architecture, small changes can result in large-scale, cascading failures. As a result, anyone working in one part of the system must constantly coordinate with anyone else working in another part of the system, including navigating complex and bureaucratic change management processes." - DORA
Coupling should occur along natural seams, with a minimal set APIs that are both stable and understandable.
To increase our ability to develop and deploy our code as quickly as possible, we choose to have a loosely-coupled architecture.
Stateless services are preferred
Stateless services run without keeping track in memory of what has occurred to influence subsequent requests to the service. This dramatically simplifies testing, upgrades, and scaling, as we do not need to 'initialise' or do any extra work to get a process ready after it has been started.
As automatic scaling is important for our platform, we choose to prefer stateless services whenever possible.
'Goldilocks' service size
A common criticism of a microservices-based architecture is that it can be taken too far, with developers creating services that are little more than a single file of code. A large number of small services creates a large amount of overhead in understanding the system, slowing developer productivity.
Similarly, services that are too large incur a large amount of build time overhead, as it takes more time to test a large service than a small one.
Some indications that a service is too small:
- Used by only one other service
- More indirection added than value
- Tight coupling to other service
Some indications that a service is too large:
- Cannot effectively use Test-Driven-Development (the cycle time to run all tests exceeds the amount of time a developer can spend waiting to make the next change)
- Team does not have confidence that running all acceptance tests is enough to verify the service is meeting its requirements
To ensure we can develop quickly, we choose to develop services that are 'just right' in terms of size: not too big and not too small. Such services are easy for a developer to modify and comprehend without impacting the comprehension of the system as a whole.
Rules of Thumb
When considering how to design a component for the system, some useful rules of thumb are:
- It should be possible to completely describe a container's functionality and responsibilities in a paragraph of text
- It should be easy to draw how the data flows between components on a whiteboard
- Put containers that must always be deployed, scaled, and failed together inside one pod (e.g. as a main container and a sidecar helper)
- If two processes can't tolerate being restarted or scaled independently, they probably belong in the same pod
- Communicate across pods only at natural system boundaries, where you'd expect to have an API, a queue, or a contract. Don't split things more finely than that
- Don't build fragile, chatty protocols between pods - the seams between them should be coarse-grained. If you find two pods constantly needing to know each other's internals, you probably split them incorrectly
- If debugging a failure requires digging across 5 pods with no clear boundary, your seams aren't clean.
Common Infrastructure Components
What we would consider the "business" processes cannot function alone. They need infrastructure components to be able to do their jobs. This section details the infrastructure and describes how the business processes can interact with them.
Database
Databases are managed externally. Both cloud and database vendors provide services that we can use, for example Mongo Atlas and Amazon RDS. Both of these examples are likely to be used by the system for specific purposes.
Having externally-managed databases means Allied Telesis does not have to manage database backups and security updates. Cloud systems often support more automatic monitoring than the free community editions, which our platform team can use to monitor performance more easily.
Authentication
The user cannot be expected to log in to multiple images. Instead, we will share a common authenticator, and once users are logged in all images will be able to contact the authenticator to verify the user.
To achieve this, we will implement Single Sign On using a common bearer token. A simple implementation of this will be using a webserver to ensure all users have a bearer token, and to reverse proxy web content from the individual pods. Pods will be able to contact the authentication application to convert the bearer token into user details, which they are expected to cache for the appropriate amount of time.
Inter-Process Communication
Containers must be able to securely communicate with other containers. For HTTPS communication, this means mutual TLS (mTLS). A service mesh will be provided that will allow containers to securely communicate to other containers without any extra code on their part. The service mesh will handle creating secure wrappers that the containers will use transparently.
Deployment
There are a number of unconnected products that Allied Telesis sell to customers. For example, OneConnect and Auger are both hosted in the cloud, but do not interact with each other. These products need not share the same development and deployment lifecycle.
See "Continuous Delivery of AT Cloud Services" for some recommendations on how to achieve a sustainable deployment model.
Living Architecture
This document and the others that accompany it do not live in isolation. They are updated whenever the engineering teams find changes are required. The process to do this begins with a single developer with an idea; they will discuss this with their team, who discusses it with their wider development center, who brings it to the steering committee.
The steering committee is formed of a representative from each center, e.g. NZ, US, JP. They review proposals brought to them, and can provide feedback or request changes before they are adopted. In this way we ensure that the common architecture is a collaborative effort, fit for all our purposes.
Although individual teams have autonomy over their images, they are expected to use the common architectural components and to bring proposals in front of the committee when they need to be changed.