How to deploy CEPH storage pools with Erasure Coding on a Proxmox cluster

Posted Nov 15, 2024 Updated Nov 19, 2025

By Aleksandr Mitkov

5 min read

The end result

2 pools with 4+2 erasure coding, slow nearline storage on spinning rust and fast solid state storage for model training. Each pool has its own CEPH FS (could be done with a single cephfs with different namespaces and usernames, but I prefer to keep them separate for simplicity).

Why CEPH?

CEPH fits the requirements:

open-source, enterprise storage solution that is currently maintained
a single file system that can be mounted on multiple servers directly
from my experience, it is more suitable for heavy loads than, say, NFS
drives across multiple nodes pooled together to combine and effectively utilize the disk space
redundancy to prevent data loss in case of disk drive malfunction
ability to seamlessly do the following without causing data loss, downtime or need to recreate the filesystem or pool:
- replace faulty drives
- replace old drives with new of a bigger capacity
- remove drives
- indefinitely expand storage capacity by adding new drives of unspecified size
- move a disk drive to another node
- add new nodes to the cluster
no vendor lock to disk drive adapter or motherboard
effortlessly assign spinning drives for the nearline storage, and solid state drives for model training
flexibility of having different types of pools with different redundancy levels using the same disk drives

Prerequisites

Running Proxmox VE 8.2 cluster with at least 3 nodes
ceph 18.2.4 reef installed on all nodes and initialized, one monitor and manager per node, ceph dashboard installed on all nodes and enabled
At least 6 disk drives per pool for 4+2 erasure coding
- HBAs flashed to IT mode
- spinning rust: same size SATA/SAS CMR enterprise-grade drives
- solid state: same size U.2 NVMe enterprise-grade SSDs, or at very least consumer-grade SATA with PLP. The latter is extremely important, drives without PLP (read: most of consumer-grade) might power-reset themselves under heavy load, causing OSDs to fail
Ubuntu or Debian VMs and LXCs that will use the CEPH FS.

Create pools

create MDSs on all nodes, give them unique names
- pveceph mds create --name first, log in onto the each node and create two MDS daemons per node per filesystem, plus one for each filesystem for standby, as we will run 3 active daemons per filesystem.
change .mgr pool’s replicated_rule to replicated_ssd to both change the failure domain to OSD and move the pool to SSD. With failure domain set to host the rule would not be able to resolve itself if there are less than three nodes in the cluster that have OSDs. More over that, this defaut rule has no device class set, so because we must set device class later for all other pools this rule will stop PG autoscaling from working on all pools, and ceph MDSs will get stuck on creating.
- ceph osd crush rule create-replicated replicated_ssd default osd ssd
- ceph osd crush rule dump replicated_ssd
- ceph osd pool set .mgr crush_rule replicated_ssd
- ceph osd crush rule rm replicated_rule

create SSD pool:

  
pveceph pool create erasure42ssd --erasure-coding k=4,m=2,failure-domain=osd,device-class=ssd --pg_autoscale_mode on --application cephfs --add_storage 0 --target-size-ratio 1
ceph osd pool set erasure42ssd-data allow_ec_overwrites true
ceph osd pool set erasure42ssd-data bulk true
ceph osd pool application enable erasure42ssd-data cephfs --yes-i-really-mean-it
ceph osd pool application disable erasure42ssd-data rbd --yes-i-really-mean-it
ceph fs new cephfsssd erasure42ssd-metadata erasure42ssd-data --force

create HDD pool:

  
pveceph pool create erasure42hdd --erasure-coding k=4,m=2,failure-domain=osd,device-class=hdd --pg_autoscale_mode on --application cephfs --add_storage 0  --target-size-ratio 1
ceph osd pool set erasure42hdd-data allow_ec_overwrites true
ceph osd pool set erasure42hdd-data bulk true
ceph osd pool application enable erasure42hdd-data  cephfs --yes-i-really-mean-it
ceph osd pool application disable erasure42hdd-data  rbd --yes-i-really-mean-it
ceph fs new cephfshdd erasure42hdd-metadata erasure42hdd-data --force

check with ceph fs ls, it should output

name: cephfsssd, metadata pool: erasure42ssd-metadata, data pools: [erasure42ssd-data ]
name: cephfshdd, metadata pool: erasure42hdd-metadata, data pools: [erasure42hdd-data ]

set all metadata pools to use replicated_ssd CRUSH rule

ceph osd pool set erasure42ssd-metadata crush_rule replicated_ssd
ceph osd pool set erasure42hdd-metadata crush_rule replicated_ssd

change CEPH FS maximum file size limit from default 1TB to 5TB for each filesystem
- first check the current limit, it should be 1099511627776
  - 1 2 ceph fs get cephfsssd | grep max_file_size ...
- then set the new limit
  - 1 2 ceph fs set cephfsssd max_file_size 5497558138880 ...
set to run multiple MDS daemons for the same filesystem at the same time to distribute the workload (very helpful with lowering server load)
- 1 2 ceph fs set cephfsssd max_mds 3 ceph fs set cephfshdd max_mds 3
(optional) add the pools as storage to the cluster using GUI (Datacenter -> Storage), give them names cfssd and cfhdd respectively

create clients and keyrings for each cephfs, use paths for flexibility

  
sudo ceph auth get-or-create client.userssd mon 'allow r' mds 'allow r path=/, allow rwps path=/userssd' osd 'allow rw pool=erasure42ssd-data' -o /etc/ceph/ceph.client.userssd.keyring
...

create directories for the clients
1 2 mkdir /mnt/pve/cfssd/userssd ...

Mount CEPH FS in a LXC

LXC must use CEPH FS through bind mounts on the host PVME node. Do the following on the PMVE host node:

create directories for the bind mounts for each filesystem
1 2 mkdir -p /mnt/bindmounts/all_hdd ...

edit host’s fstab nano /etc/fstab and add the following linesline for each mount point (can be multiple mount points per pool)

192.168.100.206,192.168.100.207,192.168.100.208:/userssd/all_ssd      /mnt/bindmounts/all_ssd   ceph    mds_namespace=cephfsssd,name=userssd,secretfile=/etc/pve/priv/ceph.client.userssd.keyring,noatime,nodiratime,noacl,_netdev,mon_addr=192.168.100.206/192.168.100.207/192.168.100.208    0       2
...

etc

if the bind mounts fail to mount after reboot:

nano /etc/systemd/system/manualfstab.service

add the following lines

[Unit]
Description=Mount host fstab manually
after=network.target

[Service]
Type=idle
User=root
ExecStart=/bin/sh -c "mount -a"
Restart=on-failure

[Install]
WantedBy=multi-user.target

systemctl enable manualfstab.service
systemctl start manualfstab.service

check the container config for existing mountpoints, increment the number for new mountpoints
- cat /etc/pve/lxc/100.conf | grep mp
- pct set 100 -mp3 /mnt/bindmounts/all_hdd,mp=/mnt/all_hdd
- or edit the config file directly nano /etc/pve/lxc/100.conf:
  - 1 mp3: /mnt/bindmounts/all_hdd,mp=/mnt/all_hdd

Mount CEPH FS in a VM (or any other server)

sudo apt install ceph-common

copy the keyrings from the PMVE host to the VM

scp root@192.168.100.206:/etc/ceph/ceph.client.userssd.keyring /etc/ceph/ceph.client.userssd.keyring
...

copy ceph config

scp root@192.168.100.206:/etc/pve/ceph.conf /etc/ceph/ceph.conf

sudo nano /etc/ceph/ceph.conf, remove all blocks except for [global], add the following lines

[client.userssd]
keyring = /etc/ceph/client.userssd.keyring
[client.userhdd]
keyring = /etc/ceph/client.userhdd.keyring

sudo nano /etc/fstab, add lines for each mount point

192.168.100.206,192.168.100.207,192.168.100.208:/userssd/all_ssd /mnt/all_ssd   ceph    mds_namespace=cephfsssd,name=userssd,secretfile=/etc/ceph/client.userssd.keyring,noatime,nodiratime,noacl,_netdev,mon_addr=192.168.100.206/192.168.100.207/192.168.100.208    0       2
...

naturally, mkdir -p /mnt/all_ssd etc
sudo mount -a
check with mount and df -h

References

https://docs.ceph.com/en/reef/
https://knowledgebase.45drives.com/kb/creating-client-keyrings-for-cephfs/
https://knowledgebase.45drives.com/kb/create-ec-profile/
https://www.ibm.com/docs/en/storage-ceph/6?topic=size-specifying-target-using-total-cluster-capacity
https://forum.proxmox.com/threads/best-practice-for-mounting-cephfs-for-both-proxmox-storage-and-lxc-bind-mount.135146/
https://docs.ceph.com/en/reef/cephfs/multimds/

tutorial

This post is licensed under CC BY 4.0 by the author.