Post

How to deploy CEPH storage pools with Erasure Coding on a Proxmox cluster

How to deploy CEPH storage pools with Erasure Coding on a Proxmox cluster

The end result

  • 2 pools with 4+2 erasure coding, slow nearline storage on spinning rust and fast solid state storage for model training. Each pool has its own CEPH FS (could be done with a single cephfs with different namespaces and usernames, but I prefer to keep them separate for simplicity).

Why CEPH?

CEPH fits the requirements:

  • open-source, enterprise storage solution that is currently maintained
  • a single file system that can be mounted on multiple servers directly
  • from my experience, it is more suitable for heavy loads than, say, NFS
  • drives across multiple nodes pooled together to combine and effectively utilize the disk space
  • redundancy to prevent data loss in case of disk drive malfunction
  • ability to seamlessly do the following without causing data loss, downtime or need to recreate the filesystem or pool:
    • replace faulty drives
    • replace old drives with new of a bigger capacity
    • remove drives
    • indefinitely expand storage capacity by adding new drives of unspecified size
    • move a disk drive to another node
    • add new nodes to the cluster
  • no vendor lock to disk drive adapter or motherboard
  • effortlessly assign spinning drives for the nearline storage, and solid state drives for model training
  • flexibility of having different types of pools with different redundancy levels using the same disk drives

Prerequisites

  • Running Proxmox VE 8.2 cluster with at least 3 nodes
  • ceph 18.2.4 reef installed on all nodes and initialized, one monitor and manager per node, ceph dashboard installed on all nodes and enabled
  • At least 6 disk drives per pool for 4+2 erasure coding
    • HBAs flashed to IT mode
    • spinning rust: same size SATA/SAS CMR enterprise-grade drives
    • solid state: same size U.2 NVMe enterprise-grade SSDs, or at very least consumer-grade SATA with PLP. The latter is extremely important, drives without PLP (read: most of consumer-grade) might power-reset themselves under heavy load, causing OSDs to fail
  • Ubuntu or Debian VMs and LXCs that will use the CEPH FS.

Create pools

  • create MDSs on all nodes, give them unique names
    • pveceph mds create --name first, log in onto the each node and create two MDS per node
  • change .mgr pool’s replicated_rule to replicated_ssd to both change the failure domain to OSD and move the pool to SSD. With failure domain set to host the rule would not be able to resolve itself if there are less than three nodes in the cluster that have OSDs. More over that, this defaut rule has no device class set, so because we must set device class later for all other pools this rule will stop PG autoscaling from working on all pools, and ceph MDSs will get stuck on creating.
    • ceph osd crush rule create-replicated replicated_ssd default osd ssd
    • ceph osd crush rule dump replicated_ssd
    • ceph osd pool set .mgr crush_rule replicated_ssd
    • ceph osd crush rule rm replicated_rule
  • create SSD pool:
    1
    2
    3
    4
    5
    6
    
    pveceph pool create erasure42ssd --erasure-coding k=4,m=2,failure-domain=osd,device-class=ssd --pg_autoscale_mode on --application cephfs --add_storage 0 --target-size-ratio 1
    ceph osd pool set erasure42ssd-data allow_ec_overwrites true
    ceph osd pool set erasure42ssd-data bulk true
    ceph osd pool application enable erasure42ssd-data cephfs --yes-i-really-mean-it
    ceph osd pool application disable erasure42ssd-data rbd --yes-i-really-mean-it
    ceph fs new cephfsssd erasure42ssd-metadata erasure42ssd-data --force
    
  • create HDD pool:
    1
    2
    3
    4
    5
    6
    
    pveceph pool create erasure42hdd --erasure-coding k=4,m=2,failure-domain=osd,device-class=hdd --pg_autoscale_mode on --application cephfs --add_storage 0  --target-size-ratio 1
    ceph osd pool set erasure42hdd-data allow_ec_overwrites true
    ceph osd pool set erasure42hdd-data bulk true
    ceph osd pool application enable erasure42hdd-data  cephfs --yes-i-really-mean-it
    ceph osd pool application disable erasure42hdd-data  rbd --yes-i-really-mean-it
    ceph fs new cephfshdd erasure42hdd-metadata erasure42hdd-data --force
    
  • check with ceph fs ls, it should output
    1
    2
    
    name: cephfsssd, metadata pool: erasure42ssd-metadata, data pools: [erasure42ssd-data ]
    name: cephfshdd, metadata pool: erasure42hdd-metadata, data pools: [erasure42hdd-data ]
    
  • set all metadata pools to use replicated_ssd CRUSH rule
    1
    2
    
    ceph osd pool set erasure42ssd-metadata crush_rule replicated_ssd
    ceph osd pool set erasure42hdd-metadata crush_rule replicated_ssd
    
  • change CEPH FS maximum file size limit from default 1TB to 5TB for each filesystem
    • first check the current limit, it should be 1099511627776
      • 1
        2
        
        ceph fs get cephfsssd | grep max_file_size
        ...
        
    • then set the new limit
      • 1
        2
        
        ceph fs set cephfsssd max_file_size 5497558138880
        ...
        
  • (optional) add the pools as storage to the cluster using GUI (Datacenter -> Storage), give them names cfssd and cfhdd respectively

  • create clients and keyrings for each cephfs, use paths for flexibility
    1
    2
    
    sudo ceph auth get-or-create client.userssd mon 'allow r' mds 'allow r path=/, allow rwps path=/userssd' osd 'allow rw pool=erasure42ssd-data' -o /etc/ceph/ceph.client.userssd.keyring
    ...
    
  • create directories for the clients
    1
    2
    
    mkdir /mnt/pve/cfssd/userssd
    ...
    

Mount CEPH FS in a LXC

LXC must use CEPH FS through bind mounts on the host PVME node. Do the following on the PMVE host node:

  • create directories for the bind mounts for each filesystem
    1
    2
    
    mkdir -p /mnt/bindmounts/all_hdd
    ...
    
  • edit host’s fstab nano /etc/fstab and add the following linesline for each mount point (can be multiple mount points per pool)
    1
    2
    
    192.168.100.206,192.168.100.207,192.168.100.208:/userssd/all_ssd      /mnt/bindmounts/all_ssd   ceph    mds_namespace=cephfsssd,name=userssd,secretfile=/etc/pve/priv/ceph.client.userssd.keyring,noatime,nodiratime,noacl,_netdev,mon_addr=192.168.100.206/192.168.100.207/192.168.100.208    0       2
    ...
    

    etc

  • if the bind mounts fail to mount after reboot:
    • nano /etc/systemd/system/manualfstab.service
      • add the following lines
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        
        [Unit]
        Description=Mount host fstab manually
        after=network.target
        
        [Service]
        Type=idle
        User=root
        ExecStart=/bin/sh -c "mount -a"
        Restart=on-failure
        
        [Install]
        WantedBy=multi-user.target
        
    • 1
      2
      
      systemctl enable manualfstab.service
      systemctl start manualfstab.service
      
  • check the container config for existing mountpoints, increment the number for new mountpoints
    • cat /etc/pve/lxc/100.conf | grep mp
    • pct set 100 -mp3 /mnt/bindmounts/all_hdd,mp=/mnt/all_hdd
    • or edit the config file directly nano /etc/pve/lxc/100.conf:
      • 1
        
        mp3: /mnt/bindmounts/all_hdd,mp=/mnt/all_hdd
        

Mount CEPH FS in a VM (or any other server)

  • sudo apt install ceph-common

  • copy the keyrings from the PMVE host to the VM
    • 1
      2
      
      scp root@192.168.100.206:/etc/ceph/ceph.client.userssd.keyring /etc/ceph/ceph.client.userssd.keyring
      ...
      
  • copy ceph config
    • 1
      
      scp root@192.168.100.206:/etc/pve/ceph.conf /etc/ceph/ceph.conf
      
    • sudo nano /etc/ceph/ceph.conf, remove all blocks except for [global], add the following lines
      • 1
        2
        3
        4
        5
        
        [client.userssd]
        keyring = /etc/ceph/client.userssd.keyring
        [client.userhdd]
        keyring = /etc/ceph/client.userhdd.keyring
        
  • sudo nano /etc/fstab, add lines for each mount point
    • 1
      2
      
      192.168.100.206,192.168.100.207,192.168.100.208:/userssd/all_ssd /mnt/all_ssd   ceph    mds_namespace=cephfsssd,name=userssd,secretfile=/etc/ceph/client.userssd.keyring,noatime,nodiratime,noacl,_netdev,mon_addr=192.168.100.206/192.168.100.207/192.168.100.208    0       2
      ...
      
  • naturally, mkdir -p /mnt/all_ssd etc

  • sudo mount -a

  • check with mount and df -h

References

  • https://docs.ceph.com/en/reef/
  • https://knowledgebase.45drives.com/kb/creating-client-keyrings-for-cephfs/
  • https://knowledgebase.45drives.com/kb/create-ec-profile/
  • https://www.ibm.com/docs/en/storage-ceph/6?topic=size-specifying-target-using-total-cluster-capacity
  • https://forum.proxmox.com/threads/best-practice-for-mounting-cephfs-for-both-proxmox-storage-and-lxc-bind-mount.135146/
This post is licensed under CC BY 4.0 by the author.