blog
|
CASE STUDY: AWS Infrastructure Modernisation & Resilience Enhancement

CASE STUDY: AWS Infrastructure Modernisation & Resilience Enhancement

Cloud Solutions
|
Case Studies & Product
Publish Date:
14/3/25

Overview

Mukuru partnered with Deimos to modernise its AWS infrastructure, enhancing resilience, cost efficiency, and high availability. Key improvements included automated disaster recovery, optimised cloud spend, EKS auto-scaling, and GitOps-driven deployments. This initiative strengthens Mukuru’s financial services with scalable, secure cloud operations.

The Challenge

Mukuru’s AWS infrastructure, which supports business-critical services, required modernisation to:

  • Optimise cloud resource utilisation
  • Enhance disaster recovery capabilities
  • Improve overall system performance

Without these improvements, Mukuru faced risks such as system downtime, inefficient resource allocation, and challenges in scaling efficiently under varying workloads.

To address these challenges, Mukuru partnered with Deimos to lead a comprehensive AWS infrastructure modernisation initiative, aimed at improving disaster recovery strategies, optimising AWS cloud costs, ensuring high availability for applications, and automating DevOps processes.

The key objectives:

  • Disaster Recovery (DR) Strategy: Enhancing multi-region readiness, implementing a comprehensive DR plan, and automating infrastructure provisioning using Terraform to ensure rapid failover during major incidents.
  • AWS Cost & Usage Optimisation: Identifying cost-saving opportunities through cloud resource analysis and optimising storage and resource allocation with Terraform improvements.
  • Application Performance & High Availability: Improving performance through better caching strategies, high availability setups for observability tools, and optimising EKS auto-scaling and node group configurations.
  • Infrastructure as Code & Automation: Introducing Infrastructure as Code (IaC) practices with Terraform to automate the management of cloud resources, detect and manage drift, and ensure consistent environments across development, QA, staging, and production.

The Solution

1. Disaster Recovery (DR) Strategy & Implementation

  • Multi-Region Readiness: Conducted DR feasibility assessments with cost-benefit analysis of different multi-region strategies.
  • Strategic DR Planning & Testing: Developed a disaster recovery roadmap, focusing on infrastructure resilience and failover strategies.
  • Infrastructure as Code (IaC) for DR: Automated DR infrastructure provisioning using Terraform.

2. AWS Cost & Usage Optimisation

  • Cloud Cost Analysis & Optimisation: Used CloudWatch for AWS cost assessments, identifying cost-saving opportunities and optimising resource allocation.
  • Efficient ECR Management: Designed a multi-threaded cleanup process for ECR images to reduce unnecessary storage costs.

3. Application Performance & High Availability

  • Memcached for Loki: Improved Loki performance with caching to speed up query times.
  • Prometheus High Availability (HA): Deployed Thanos in production for HA Prometheus monitoring.
  • EKS Auto-Scaling & Node Group Optimisation: Implemented auto-scaling improvements for Mukuru’s EKS clusters.

4. Terraform & DevOps Automation

  • Terraform Drift Detection: Automated infrastructure drift detection to improve consistency.
  • Continuous Integration & GitOps: Used GitLab CI for automated infrastructure testing and Argo CD for Kubernetes GitOps.

5. Security & Governance

  • Multi-Account AWS Structure: Managed via AWS Organisations for centralised security and cost control.
  • Identity & Access Management (IAM): Enforced Azure  AD-based single sign-on (SSO) with role-based access.
  • Encryption Policy: Implemented encryption for data at rest (AWS KMS) and in transit (TLS 1.2+).

6. Reliability & Observability

  • RTO/RPO Definition: Defined recovery time objectives and recovery point objectives per workload.
  • Observability Enhancements: Integrated Prometheus for monitoring, Loki for log aggregation, and Grafana for visualisation.

7. Cost Optimisation

  • Cost Modelling & Review: Provided AWS cost models, identified right-sizing opportunities, and optimised deployment pipelines to reduce unnecessary costs.

The Results

1. Infrastructure Modernisation & Resilience

  • Improved Disaster Recovery (DR) Readiness: Multi-region feasibility assessments and Terraform-driven failover automation ensure rapid recovery in case of incidents.
  • Reduced Configuration Drift: Automated Terraform drift detection keeps infrastructure consistent and reduces manual intervention.

2. Cost Optimisation & Efficiency Gains

  • Reduced AWS Costs: CloudWatch-driven cost analysis led to better resource allocation.
  • ECR cleanup process lowered storage costs by eliminating unnecessary container images.
  • Right-Sized Infrastructure: Optimisation of EKS clusters and autoscaling reduced cloud expenses while maintaining performance.

3. Performance & High Availability

  • Faster Query Performance: Implementing Memcached for Loki significantly improved observability data retrieval times.
  • High Availability Monitoring: Deployed Thanos for Prometheus, ensuring resilient and continuous monitoring across environments.
  • Scalable EKS Clusters: Auto-scaling enabled workloads to dynamically adjust based on demand.

4. Security & Governance Improvements

  • Stronger Identity & Access Management (IAM): Azure  AD-based SSO enforced multi-factor authentication (MFA) and least-privilege role assumptions.
  • Improved Compliance & Auditability 
  • All Terraform changes undergo peer review.
  • AWS CloudTrail logs ensure traceability for all API calls.

5. Operational Excellence & Deployment Efficiency

  • Faster & More Reliable Deployments: Standardised Terraform and CI/CD pipelines enabled predictable, low-risk deployments.
  • Reduced Deployment Failures: The GitOps workflow with Argo CD ensured all application deployments matched the "source of truth" in Git repositories.
  • Lower Mean Time to Resolution (MTTR): Unified observability across Prometheus, Loki, and Grafana allowed teams to detect and resolve issues faster.

Key Takeaways

1. Disaster Recovery should be automated and regularly tested to ensure business continuity.

2. Cloud Cost Management requires continuous monitoring and proactive right-sizing of resources.

3. IAM & Role-Based Access must be enforced to enhance security and prevent privilege misuse.

4. GitOps & Automated CI/CD Pipelines reduce manual errors and deployment risks.

5. Strong Observability (metrics, logs, alerts) accelerates troubleshooting and improves system reliability.

6. Cloud Governance should be standardised with automated policy enforcement and periodic

Tools Used

  • Automation & DevOps: Terraform, Argo CD, GitLab CI/CD
  • Security & Governance: Azure AD, AWS IAM, KMS, TLS Encryption
  • Observability Stack: Prometheus, Thanos, Loki, Grafana
  • AWS Services: EKS, CloudWatch, RDS, S3, Route 53, VPCs

About Mukuru

Industry: Digital Payments
Location: South Africa
Description: Mukuru is a leading financial services provider focused on remittances, payments, and digital banking solutions across Africa and emerging markets.

Share Article:

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

  • Item A
  • Item B
  • Item C

Text link

Bold text

Emphasis

Superscript

Subscript