Team of professionals

Back to all news

Disaster Recovery in Ceph with cephadm, Ceph-CSI, and RBD Mirror

Introduction

Ceph is a highly available, scalable, and resilient storage solution widely used in cloud and enterprise environments. However, even with its built-in redundancy, disaster recovery (DR) strategies are essential to ensure business continuity in case of data center failures, network outages, or hardware failures. Ceph provides robust disaster recovery options, including RBD mirroring, to replicate block storage volumes across geographically separated Ceph clusters.

With the introduction of cephadm, Ceph cluster management has become more straightforward, making it easier to deploy and maintain disaster recovery setups. Additionally, Ceph-CSI enables Kubernetes clusters to consume Ceph storage efficiently. In this article, we will explore how to set up disaster recovery in Ceph using cephadm, Ceph-CSI, and RBD Mirror to protect RBD volumes used by Kubernetes clusters deployed across two data centers.

Disaster Recovery Architecture

We have two geographically separated data centers:

  • Primary Data Center (Production): Hosts a Kubernetes cluster and a Ceph cluster.
  • Secondary Data Center (Disaster Recovery – DR): Hosts another Kubernetes cluster and a Ceph cluster where data is replicated using RBD mirroring.

Kubernetes workloads in the primary DC store their persistent data in Ceph RBD volumes via Ceph-CSI. These volumes are mirrored asynchronously to the secondary DC using RBD mirroring, ensuring data availability in case of a failure in the primary DC.

Deploying Ceph with cephadm in Both Data Centers

Bootstrap the Ceph Cluster

On each Ceph cluster (Primary and Secondary):

cephadm bootstrap --mon-ip 

Add Additional Nodes

cephadm shell -- ceph orch host add  

Deploy Required Services

ceph orch apply mon
ceph orch apply mgr
ceph orch apply osd --all-available-devices
ceph orch apply rbd-mirror

Ensure that the rbd-mirror daemon is running on both clusters:

ceph orch ps | grep rbd-mirror

Configure RBD Mirroring

On the primary Ceph cluster:

rbd mirror pool enable  snapshot

Export and import the authentication key:

ceph auth get-key client.rbd-mirror > rbd-mirror.key
scp rbd-mirror.key 
ssh  'ceph auth import -i rbd-mirror.key'

On the secondary Ceph cluster, add a peer connection:

rbd mirror pool peer add  client.rbd-mirror@

Verify peering status:

rbd mirror pool status 

Installing Ceph-CSI in Kubernetes Clusters

Now we have ceph cluster ready and we can deploy ceph-csi on our k8s clusters. We need to deploy ceph-csi in both locations, but

Deploy Ceph-CSI Driver

kubectl apply -f https://raw.githubusercontent.com/ceph/ceph-csi/devel/deploy/csi-rbdplugin.yaml
kubectl apply -f https://raw.githubusercontent.com/ceph/ceph-csi/devel/deploy/csi-rbdplugin-provisioner.yaml 

Enable RBD Mirroring on the Pool

rbd mirror pool enable  snapshot

Configure StorageClass to Use the Mirrored Pool

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ceph-rbd-mirrored
provisioner: rbd.csi.ceph.com
parameters:
  clusterID: 
  pool: 
  imageFormat: "2"
  imageFeatures: layering
  csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret
  csi.storage.k8s.io/provisioner-secret-namespace: default
  csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret
  csi.storage.k8s.io/node-stage-secret-namespace: default
reclaimPolicy: Delete
allowVolumeExpansion: true

Apply this StorageClass:

kubectl apply -f storageclass.yaml

Create a PersistentVolumeClaim (PVC)

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ceph-rbd-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: ceph-rbd-mirrored

Failover Process (Switching to the Secondary Data Center)

Promote the Secondary Ceph Cluster

rbd mirror pool promote 

Update ClusterID and PoolID Mappings

Ensure that the Kubernetes cluster in the DR site correctly maps the Ceph cluster’s ClusterID and PoolID using the predefined mapping.

       apiVersion: v1
       kind: ConfigMap
       metadata:
         name: ceph-csi-config
       data:
         cluster-mapping.json: |-
           [
             {
               "clusterIDMapping": {
                 "primary-cluster-id": "secondary-cluster-id"
               },
               "RBDPoolIDMapping": [
                 {
                   "1": "2"
                 },
                 {
                   "11": "12"
                 }
               ]
             }
           ]

Apply this updated mapping:

      
kubectl apply -f ceph-csi-config.yaml

Modify Ceph-CSI Config to Update Monitor Addresses on Secondary Cluster

To use a mirrored and promoted RBD image on a secondary site during a failover, you need to replace the primary monitor addresses with the IP addresses of the secondary cluster in ceph-csi-config. Otherwise, Ceph-CSI won’t be able to use the volumes, and application pods will become stuck in the ContainerCreating state. Thus, during failover, both clusters will have the same monitor IP addresses in csi-config on secondary site.

       apiVersion: v1
       kind: ConfigMap
       metadata:
         name: ceph-csi-config
       data:
         config.json: |-
           [
             {
              "clusterID": "ceph1",
              "rbd": {
                 "radosNamespace": "",
              },
              "monitors": [
                "192.168.39.82:6789"
              ],
              "cephFS": {
                "subvolumeGroup": ""
              }
             },
             {
              "clusterID": "ceph2",
              "rbd": {
                 "radosNamespace": "",
              },
              "monitors": [
                "192.168.39.82:6789"
              ],
              "cephFS": {
                "subvolumeGroup": ""
              }
             }
           ]

Apply the updated configuration

kubectl apply -f ceph-csi-config.yaml

Modify StorageClass to Point to the Secondary Cluster

parameters:
clusterID: secondary-cluster-id

Apply the modified StorageClass:

kubectl apply -f storageclass.yaml

Restart Affected Workloads

kubectl rollout restart deployment 

Validate Data Accessibility

Ensure the applications can access data stored in the secondary Ceph cluster.

Failback Process (Restoring to the Primary Data Center)

Demote the Secondary Cluster and Re-enable Mirroring

rbd mirror pool demote 

Update ClusterID and PoolID Mappings Back to Primary

apiVersion: v1
kind: ConfigMap
metadata:
  name: ceph-csi-config

Modify StorageClass to Point Back to the Primary Cluster

parameters:
clusterID: primary-cluster-id

Restart Workloads to Use the Primary Cluster

kubectl rollout restart deployment 

Verify Mirroring and Data Integrity

rbd mirror pool status 

Conclusion

By configuring ClusterID and PoolID mappings and ensuring proper Ceph monitor address updates during failover, you enable seamless disaster recovery for Kubernetes workloads using Ceph-CSI. This approach maintains data accessibility and consistency, facilitating a smoother failover and failback process. Using cephadm, deploying and managing mirroring has become significantly easier, enabling organizations to set up failover mechanisms efficiently. By following the above steps, you can ensure data integrity, minimize downtime, and enhance business continuity in the event of a disaster.

Author

Kamil Madáč
Grow2FIT Cloud&DevOps Consultant

Kamil is a Senior Cloud / Infrastructure consultant with 20+ years of experience and strong know-how in designing, implementing, and administering private cloud solutions (primarily built on OpenSource solutions such as OpenStack). He has many years of experience with application development in Python and currently also with development in Go. Kamil has substantial know-how in SDS (Software-defined storage), SDN (Software-defined networking), Data storage (Ceph, NetApp), administration of Linux servers and operation of deployed solutions.
Kamil regularly contributes to OpenSource projects (OpenStack, Kuryr, Requests Lib – Python).

The entire Grow2FIT consulting team: Our Team

Related services

Team of professionals

Back to all news

Enhancing DevOps Excellence: Our Collaboration with ČSOB CZ

🇨🇿 Greetings from Prague

We’re on-site with CSOB, diving deep into the assessment of their DevOps capabilities. Our focus? Enhancing practices in these critical areas:

🔧 GitOps – Streamlining configuration, versioning, and automated deployments
🌐 Service Mesh – Optimizing service discovery, traffic management, and security integration
📊 Observability – Building robust monitoring, logging, and alerting systems for long-term insights

Excited to collaborate with the talented ČSOB DevOps team to align their practises with cutting-edge industry standards while respecting the unique needs of the banking sector. 🚀

Check our DevOps services here

Team of professionals

Back to all news

🎉 Welcome Jana Revajová to our team! 🚀

Jana is a skilled Solution Architect and Project Manager with extensive experience in designing and delivering IT solutions in the financial services industry.

She specializes in loans, risk management, payments, cards, and ATM & POS processing, combining modern technologies with legacy systems to deliver innovative and scalable solutions.

Her contributions include designing and implementing a cloud-based banking platform, streamlining operations, and driving customer satisfaction.

With certifications in TOGAF, ITIL, and Prince2, Jana brings expertise in solution design and cross-functional collaboration.

Check our team here

Team of professionals

Back to all news

Case Study: Solargis – Ceph Design & Consultancy

For Solargis, a leading solar data and analytics provider, we designed and implemented a Ceph cluster to meet their increasing data storage and performance demands. The project involved replacing their existing NFS server setup with a scalable, high-performance, and cost-effective solution. In addition to the initial implementation, we provided ongoing consultancy to ensure their infrastructure operated optimally and supported their evolving business needs.

Project Highlights

  • Ceph Cluster Design & Implementation: We designed and deployed a Ceph cluster tailored to Solargis’ specific workload requirements.
  • Performance Optimization: Delivered tuning recommendations for CephFS performance, including cache and striping configurations.
  • Advanced Feature Integration:Implemented S3-compatible object storage and seamless integration with Kubernetes.
  • Consultancy & Training:Conducted regular workshops and Q&A sessions on best practices and advanced configurations.
  • Issue Resolution:Supported Solargis in troubleshooting and resolving performance bottlenecks and misconfigurations.

Benefits for Solargis

The Ceph solution delivered numerous benefits for Solargis:

  • High Performance & Scalability: Ceph provided the robust performance needed for their intensive data workloads while offering seamless scalability as their storage needs grew.
  • Enhanced Functionality: With features like S3 object storage and Kubernetes integration, Solargis unlocked new capabilities that improved operations and supported containerized workloads.
  • Cost Efficiency: Ceph allowed Solargis to avoid the high costs associated with proprietary storage systems, offering a robust solution at a fraction of the price.
  • Operational Flexibility: Ceph’s flexible architecture enabled the customization of storage solutions, including multi-zone replication and advanced file system configurations.
  • Future-Proof Infrastructure: Solargis gained a modern, reliable storage platform that continues to evolve with their business requirements.

Client Statement

„The Grow2Fit team helped us implement Ceph smoothly and effectively. They not only designed and implemented a solution that met our expectations but also provided ongoing consultancy that ensured we could fully leverage Ceph’s advanced features.

Their hands-on approach, in-depth workshops, and rapid troubleshooting support gave us the confidence to push the boundaries of what Ceph can do for us.“

Miroslav Moravčík

Our work ensured a smooth transition to Ceph, resolved complex challenges, and empowered Solargis to maximize the value of their investment.

Provided services

Key Technologies

  • Ceph

Team of professionals

Back to all news

New Project for Kvapay: Crypto Payment Solutions Infrastructure

We are excited to announce a new collaboration with Kvapay to enhance their infrastructure for Crypto Payment Solutions. Kvapay, a leading provider of cryptocurrency payment services, offers secure and efficient solutions for businesses and individuals to manage their digital assets. With a growing network of Kvakomat cryptocurrency ATMs and a comprehensive crypto wallet, they are driving innovation in the digital payments industry.

This partnership aims to deliver a robust, scalable, and secure environment to support Kvapay’s innovative financial services.

Project Highlights

  • GitOps Implementation
  • Kubernetes Clusters
  • Support Infrastructure (Monitoring, Alerting, Tracking)
  • Monitoring & Observability
  • Enhanced Security and Reliability

Benefits for Kvapay
This project will equip Kvapay with a cutting-edge infrastructure that:

  • Enhances system reliability and performance
  • Streamlines deployment processes
  • Provides real-time insights for proactive management
  • Ensures data security and business continuity

We proudly support Kvapay’s mission to revolutionize crypto payment solutions. Stay tuned for more updates as we move forward!

Team of professionals

Back to all news

🌞 Goodbye Summer Party at Skalka near Kremnica! 🗻

We stepped out of our comfort zones and tackled the beautiful Via Ferrata tracks together. As a service company delivering tailored solutions with diverse teams across different clients, it’s always a treat to come together as one team. This adventure was a fantastic way to bond, recharge, and reflect on our accomplishments last year. 💪🤝

Team of professionals

Back to all news

Case Study: Deutsche Telekom – Open Sovereign Cloud

Deutsche Telekom, a leading telecommunications and IT services provider in Europe, embarked on an ambitious project to develop a sovereign cloud platform. The aim was to create a secure, compliant, and highly interoperable cloud solution using open-source technologies. This case study outlines the motivations, architecture, and innovative aspects of this project, showcasing its potential benefits for developers and businesses alike.

Motivation and Objectives

The project was driven by two primary principles: openness and sovereignty. By leveraging open-source components, the platform ensures transparency and flexibility. Sovereignty is achieved by adhering to the guidelines set by the GAIA-X initiative, which promotes data and operational sovereignty within the European Union. This ensures compliance with EU laws, providing users with freedom of choice and interoperability across multiple cloud providers.

Key Features and Architecture

The cloud platform is structured into three main layers, each with its unique features and capabilities:

  1. Infrastructure Layer:
    • MetalStack Technology: The infrastructure is based on a modern, Kubernetes-native technology called MetalStack. This offers essential infrastructure services like compute resources (virtual machines), storage (using Ceph), and networking (based on SONiC).
    • Kubernetes Integration: MetalStack leverages Kubernetes for resource management, providing a cloud-native, scalable, and efficient infrastructure solution.
  2. Platform as a Service (PaaS):
    • Gardener: This orchestration tool manages Kubernetes clusters, allowing for seamless integration with various infrastructures. It supports multiple Kubernetes versions and offers geo-redundancy through its garden, seed, and shoot cluster architecture.
    • Automated Management: Users can easily create and manage Kubernetes clusters via a user-friendly dashboard or APIs, supporting CI/CD pipelines for automated deployments.
  3. Software as a Service (SaaS):
    • Kyma Runtime: Kyma enhances Kubernetes with additional tools for serverless functions, API gateway, service mesh (Istio), and observability (Prometheus, Grafana, Loki, Jaeger).
    • Service Catalog: A comprehensive catalog of ready-made services like PostgreSQL, Kafka, Redis, and more, allowing developers to build applications quickly using these pre-configured components.

Innovation and Security

One of the most innovative aspects of the platform is its support for confidential computing. This technology addresses the challenge of securing in-memory data by encrypting the entire memory context of running containers. Leveraging Intel’s SGX technology, the platform ensures that even memory snapshots remain encrypted, preventing unauthorized access to sensitive data. This level of security makes the platform suitable for high-stakes applications in sectors like healthcare and defense.

Development Process and Team Culture

The development of this platform follows agile methodologies, with cross-functional teams working collaboratively across different layers of the stack.

Key technologies and tools used include:

  • Programming Languages: Go, shell scripting, C (for network acceleration), and Python (for testing).
  • Operating Systems: A customized Debian-based Linux distribution called Garden Linux.
  • Development Tools: Git and GitLab for version control, task management, and CI/CD pipelines.

The team’s culture emphasizes transparency, collaboration, and continuous improvement, with regular sprint reviews and quarterly face-to-face meetings to align on priorities and address challenges.

Conclusion

The open-source and sovereign cloud platform developed by our client represents a significant advancement in cloud technology, combining compliance, security, and interoperability. By adhering to GAIA-X principles and leveraging cutting-edge technologies, the platform offers a robust solution for businesses seeking a secure and flexible cloud environment. This project not only sets a new standard for cloud services in Europe but also provides a model for future innovations in the industry.

Provided services

Key Technologies

  • Kubernetes
  • MetalStack
  • Ceph
  • Gardener
  • Kyma
  • Go
  • GitLab

Team of professionals

Back to all news

Case study: SoftPoint – Enhancing Infrastructure and Deployment Efficiency

Streamlining Processes, Improving Scalability, and Reducing Costs through Comprehensive Technical Solutions

Overview

Softpoint sought assistance with system infrastructure, monitoring, integration, and deployment processes. We conducted a comprehensive analysis of key areas to effectively address their needs.

Analysis Areas

  • Infrastructure: Reviewed and optimized Kubernetes, virtual machines, and PostgreSQL setups.
  • Monitoring: Developed dashboards to identify performance bottlenecks.
  • Resource Limitation: Implemented tenant-based resource limits.
  • Auto-Deployment and GitLab CI: Streamlined deployment processes.
  • Cost Analysis: Identified opportunities for cost savings.
  • Auto-Scaling Pods: Planned for future scalability.

Implementation

We integrated auto-deployment scripts with GitLab CI, addressed pipeline issues, and enhanced deployment processes. The infrastructure was upgraded, including Kubernetes and PostgreSQL tweaks, and new instance pools were configured for cost efficiency.

Infrastructure Changes

  • Upgraded Kubernetes and optimized worker configurations.
  • Implemented cost-saving measures, reducing expenses by hundreds of EUR per month

Additional Improvements

  • Enabled security features like WAF and session stickiness.
  • Optimized PostgreSQL settings and addressed memory management issues.

Outcome

The collaboration resulted in streamlined automated deployment, improved operational efficiency, scalability, and cost savings. Our partnership with Softpoint led to infrastructure and process improvements, setting the stage for future growth and scalability.

Contact Person

Peter Jakubík, CEO SoftPoint

Provided services

Key Technologies

  • MS Azure
  • Kubernetes
  • WAF
  • PostgreSQL
  • GitLab

Team of professionals

Back to all news

Kubernetes Days Prague 2024: Gabriel Illés on Observability with OpenTelemetry

🌟 Just back from Kubernetes Days Prague 2024 where our Senior DevOps Engineer, Gabriel Illés, presented on “Observability with OpenTelemetry Collector in distributed cloud and edge computing.” He discussed the challenges and strategies for implementing observability in complex environments, using OpenTelemetry Collector.

For those interested in diving deeper, the presentation is available here:

Team of professionals

Back to all news

Welcome Aboard: Dominika Pénzeš Joins Our Management Team to Lead IT Specialist Sourcing

Disrupting our clients’ technology landscapes is a complex challenge — that’s why it’s crucial to have the right team. We are thrilled to announce that Dominika Pénzeš is joining our management team, where she will lead our IT specialists sourcing service. Having known Dominika for many years, we are confident in her expertise and excited about the new perspectives she will bring. Please join us in wishing Dominika great success in her new role.

Check our team here