3.4. Infrastructure#

The platform has been designed and developed for the AWS infrastructure, and this is the first choice for most initiatives. However, other alternatives for infrastructure are also possible with some additional effort, including Microsoft Azure and even private infrastructures.

The Open edX platform has matured over 10 years powering online learning inititives at scales ranging from a few hundred users to millions of registered users. Since 2021, there has been an ongoing transition of the open-source community to Kubernetes (k8s), and there are currently various alternatives for it:

  • EKS (Elastic Container Service for Kubernetes): A managed Kubernetes service provided by Amazon Web Services (AWS).

  • AKS (Azure Kubernetes Service): A managed Kubernetes service offered by Microsoft Azure.

  • Rancher: It is an open-source platform that provides a management UI for multiple Kubernetes clusters, regardless of where they are hosted. Rancher can host a Kubernetes clusters in any data center or manage clusters hosted on any cloud provider.

3.4.1. Solution Architecture at Scale#

Scaling up is one of the things the platform does best. A k8s installation can start exceedingly small and grow according to the usage needs. However, a larger scale comes with more significant complexity and added costs, so an initiative usually starts small and grows the platform architecture as the initiative’s needs grow.

3.4.1.1. Small Scale:#

When hosting a small-scale Kubernetes cluster, using just one machine is not ideal, as this can lead to a lack of high availability and data loss in case of a machine failure. Instead, having a cluster of at least three machines for handling real user traffic is recommended. Having a separate machine for the database service is also a good idea since keeping databases within the Kubernetes cluster is generally not recommended due to their special storage needs, complex scaling requirements, and potential performance issues.

In addition to the database and worker machines, there is also a controller for managing incoming traffic, usually a load balancer service provided by AWS or Azure. An additional machine is required if you are using a self-hosted cluster through Rancher.

According to the previous information, even a small initiative will need at least six virtual machines to run the Open edX platform in a production environment properly.

Traffic Capacity:

100~200 simultaneous users (concurrent users).

While every initiative is different, as a rule of thumb, you can expect less than 10% of the total student population to be concurrent users at a particular time. This scale will apply to a total student population of around 1000-3000 learners.

For example:

Small-scale architecture model image example.

3.4.1.2. Medium Scale:#

More extensive amounts of traffic will require the infrastructure to follow suit. For the different layers, this means various interventions.

The database server becomes two different clusters. One for MySQL with a primary replica architecture. Mongo DB becomes a replica set with three nodes. The k8s cluster grows in the number of nodes from five to around thirty. This is naturally only done in response to traffic, and when available, we use auto-scaling groups or other horizontal escalation technologies to keep costs under control. The ingress controller is growing and could also be considered a candidate for a high-availability configuration.

At this scale, the cost of hosting user-generated files, backups, tracking logs, and operational logs stops being negligible and should be considered as well.

Traffic Capacity:

1000~2000 simultaneous users. (Concurrent users).

While every initiative is different, as a rule of thumb, you can expect less than 10% of the total student population to be concurrent users at a particular time. This scale will apply to a total student population of around 10k-30k learners.

Example:

Medium scale architecture model image example.

3.4.1.3. Large Scale:#

Requirements at this scale are less of a recipe and more a response to the initiative needs with a data-driven approach. We can still give some ballpark estimates of what it takes to run the platform at this level.

The database becomes the biggest bottleneck and it needs to be turned it into a multi-master configuration. Standard servers will also double in size for the data layer. Moving to dedicated services for MySQL Hosting is an option, albeit a more expensive one. Mongo could still be holding, but if it is not, two more nodes can be added. Auxiliary services such as Redis and Elasticsearch have become their layer and should be hosted independently.

The application cluster must dynamically scale in order to optimize the infrastructure cost. It is always running at ten nodes min, but it might reach one hundred nodes of this size at peak capacity.

Traffic Capacity:

5000~10000 simultaneous users. (Concurrent users).

While every initiative is different, as a rule of thumb, you can expect less than 10% of the total student population to be concurrent users at a particular time, so that this scale will apply for a total student population of around 50k-100k learners.

For example:

Large scale architecture model image example.

3.4.1.4. Even Larger Scale:#

This scale applies when the initiative has grown to an immense scale. To reach this traffic volume, the platform architecture will need to be adjusted to better suit your the specific case better. This means the initiative needs a dedicated team of developers and engineers to maintain the platform.