Mukuru & Deimos: AWS Case Study

blog

CASE STUDY: AWS Infrastructure Modernisation & Resilience Enhancement

Cloud Solutions

Case Studies & Product

Yekeen Ajeigbe

Head of Engineering

Publish Date:

14/3/25

Overview

Mukuru partnered with Deimos to modernise its AWS infrastructure, enhancing resilience, cost efficiency, and high availability. Key improvements included automated disaster recovery, optimised cloud spend, EKS auto-scaling, and GitOps-driven deployments. This initiative strengthens Mukuru’s financial services with scalable, secure cloud operations.

The Challenge

Mukuru’s AWS infrastructure, which supports business-critical services, required modernisation to:

Optimise cloud resource utilisation
Enhance disaster recovery capabilities
Improve overall system performance

Without these improvements, Mukuru faced risks such as system downtime, inefficient resource allocation, and challenges in scaling efficiently under varying workloads.

To address these challenges, Mukuru partnered with Deimos to lead a comprehensive AWS infrastructure modernisation initiative, aimed at improving disaster recovery strategies, optimising AWS cloud costs, ensuring high availability for applications, and automating DevOps processes.

The key objectives:

Disaster Recovery (DR) Strategy: Enhancing multi-region readiness, implementing a comprehensive DR plan, and automating infrastructure provisioning using Terraform to ensure rapid failover during major incidents.
AWS Cost & Usage Optimisation: Identifying cost-saving opportunities through cloud resource analysis and optimising storage and resource allocation with Terraform improvements.
Application Performance & High Availability: Improving performance through better caching strategies, high availability setups for observability tools, and optimising EKS auto-scaling and node group configurations.
Infrastructure as Code & Automation: Introducing Infrastructure as Code (IaC) practices with Terraform to automate the management of cloud resources, detect and manage drift, and ensure consistent environments across development, QA, staging, and production.

The Solution

1. Disaster Recovery (DR) Strategy & Implementation

Multi-Region Readiness: Conducted DR feasibility assessments with cost-benefit analysis of different multi-region strategies.
Strategic DR Planning & Testing: Developed a disaster recovery roadmap, focusing on infrastructure resilience and failover strategies.
Infrastructure as Code (IaC) for DR: Automated DR infrastructure provisioning using Terraform.

‍2. AWS Cost & Usage Optimisation

Cloud Cost Analysis & Optimisation: Used CloudWatch for AWS cost assessments, identifying cost-saving opportunities and optimising resource allocation.
Efficient ECR Management: Designed a multi-threaded cleanup process for ECR images to reduce unnecessary storage costs.

‍3. Application Performance & High Availability

Memcached for Loki: Improved Loki performance with caching to speed up query times.
Prometheus High Availability (HA): Deployed Thanos in production for HA Prometheus monitoring.
EKS Auto-Scaling & Node Group Optimisation: Implemented auto-scaling improvements for Mukuru’s EKS clusters.

‍4. Terraform & DevOps Automation

Terraform Drift Detection: Automated infrastructure drift detection to improve consistency.
Continuous Integration & GitOps: Used GitLab CI for automated infrastructure testing and Argo CD for Kubernetes GitOps.

‍5. Security & Governance

Multi-Account AWS Structure: Managed via AWS Organisations for centralised security and cost control.
Identity & Access Management (IAM): Enforced Azure AD-based single sign-on (SSO) with role-based access.
Encryption Policy: Implemented encryption for data at rest (AWS KMS) and in transit (TLS 1.2+).

‍6. Reliability & Observability

RTO/RPO Definition: Defined recovery time objectives and recovery point objectives per workload.
Observability Enhancements: Integrated Prometheus for monitoring, Loki for log aggregation, and Grafana for visualisation.

‍7. Cost Optimisation

Cost Modelling & Review: Provided AWS cost models, identified right-sizing opportunities, and optimised deployment pipelines to reduce unnecessary costs.

The Results

1. Infrastructure Modernisation & Resilience

Improved Disaster Recovery (DR) Readiness: Multi-region feasibility assessments and Terraform-driven failover automation ensure rapid recovery in case of incidents.
Reduced Configuration Drift: Automated Terraform drift detection keeps infrastructure consistent and reduces manual intervention.

2. Cost Optimisation & Efficiency Gains

Reduced AWS Costs: CloudWatch-driven cost analysis led to better resource allocation.
ECR cleanup process lowered storage costs by eliminating unnecessary container images.
Right-Sized Infrastructure: Optimisation of EKS clusters and autoscaling reduced cloud expenses while maintaining performance.

3. Performance & High Availability

Faster Query Performance: Implementing Memcached for Loki significantly improved observability data retrieval times.
High Availability Monitoring: Deployed Thanos for Prometheus, ensuring resilient and continuous monitoring across environments.
Scalable EKS Clusters: Auto-scaling enabled workloads to dynamically adjust based on demand.

4. Security & Governance Improvements

Stronger Identity & Access Management (IAM): Azure AD-based SSO enforced multi-factor authentication (MFA) and least-privilege role assumptions.
Improved Compliance & Auditability
All Terraform changes undergo peer review.
AWS CloudTrail logs ensure traceability for all API calls.

5. Operational Excellence & Deployment Efficiency

Faster & More Reliable Deployments: Standardised Terraform and CI/CD pipelines enabled predictable, low-risk deployments.
Reduced Deployment Failures: The GitOps workflow with Argo CD ensured all application deployments matched the "source of truth" in Git repositories.
Lower Mean Time to Resolution (MTTR): Unified observability across Prometheus, Loki, and Grafana allowed teams to detect and resolve issues faster.

Key Takeaways

1. Disaster Recovery should be automated and regularly tested to ensure business continuity.

2. Cloud Cost Management requires continuous monitoring and proactive right-sizing of resources.

3. IAM & Role-Based Access must be enforced to enhance security and prevent privilege misuse.

4. GitOps & Automated CI/CD Pipelines reduce manual errors and deployment risks.

5. Strong Observability (metrics, logs, alerts) accelerates troubleshooting and improves system reliability.

6. Cloud Governance should be standardised with automated policy enforcement and periodic

Tools Used

Automation & DevOps: Terraform, Argo CD, GitLab CI/CD
Security & Governance: Azure AD, AWS IAM, KMS, TLS Encryption
Observability Stack: Prometheus, Thanos, Loki, Grafana
AWS Services: EKS, CloudWatch, RDS, S3, Route 53, VPCs

About Mukuru

Industry: Digital Payments
Location: South Africa
Description: Mukuru is a leading financial services provider focused on remittances, payments, and digital banking solutions across Africa and emerging markets.