How to deploy CEPH storage pools with Erasure Coding on a Proxmox cluster
How to deploy CEPH storage pools with Erasure Coding on a Proxmox cluster
The end result
- 2 pools with 4+2 erasure coding, slow nearline storage on spinning rust and fast solid state storage for model training. Each pool has its own CEPH FS (could be done with a single cephfs with different namespaces and usernames, but I prefer to keep them separate for simplicity).
Why CEPH?
CEPH fits the requirements:
- open-source, enterprise storage solution that is currently maintained
- a single file system that can be mounted on multiple servers directly
- from my experience, it is more suitable for heavy loads than, say, NFS
- drives across multiple nodes pooled together to combine and effectively utilize the disk space
- redundancy to prevent data loss in case of disk drive malfunction
- ability to seamlessly do the following without causing data loss, downtime or need to recreate the filesystem or pool:
- replace faulty drives
- replace old drives with new of a bigger capacity
- remove drives
- indefinitely expand storage capacity by adding new drives of unspecified size
- move a disk drive to another node
- add new nodes to the cluster
- no vendor lock to disk drive adapter or motherboard
- effortlessly assign spinning drives for the nearline storage, and solid state drives for model training
- flexibility of having different types of pools with different redundancy levels using the same disk drives
Prerequisites
- Running
Proxmox VE 8.2
cluster with at least 3 nodes - ceph
18.2.4 reef
installed on all nodes and initialized, one monitor and manager per node, ceph dashboard installed on all nodes and enabled - At least 6 disk drives per pool for 4+2 erasure coding
- HBAs flashed to IT mode
- spinning rust: same size SATA/SAS CMR enterprise-grade drives
- solid state: same size U.2 NVMe enterprise-grade SSDs, or at very least consumer-grade SATA with PLP. The latter is extremely important, drives without PLP (read: most of consumer-grade) might power-reset themselves under heavy load, causing OSDs to fail
- Ubuntu or Debian VMs and LXCs that will use the CEPH FS.
Create pools
- create MDSs on all nodes, give them unique names
pveceph mds create --name first
, log in onto the each node and create two MDS per node
- change
.mgr
pool’sreplicated_rule
to replicated_ssd to both change the failure domain to OSD and move the pool to SSD. With failure domain set to host the rule would not be able to resolve itself if there are less than three nodes in the cluster that have OSDs. More over that, this defaut rule has no device class set, so because we must set device class later for all other pools this rule will stop PG autoscaling from working on all pools, and ceph MDSs will get stuck on creating.ceph osd crush rule create-replicated replicated_ssd default osd ssd
ceph osd crush rule dump replicated_ssd
ceph osd pool set .mgr crush_rule replicated_ssd
ceph osd crush rule rm replicated_rule
- create SSD pool:
1 2 3 4 5 6
pveceph pool create erasure42ssd --erasure-coding k=4,m=2,failure-domain=osd,device-class=ssd --pg_autoscale_mode on --application cephfs --add_storage 0 --target-size-ratio 1 ceph osd pool set erasure42ssd-data allow_ec_overwrites true ceph osd pool set erasure42ssd-data bulk true ceph osd pool application enable erasure42ssd-data cephfs --yes-i-really-mean-it ceph osd pool application disable erasure42ssd-data rbd --yes-i-really-mean-it ceph fs new cephfsssd erasure42ssd-metadata erasure42ssd-data --force
- create HDD pool:
1 2 3 4 5 6
pveceph pool create erasure42hdd --erasure-coding k=4,m=2,failure-domain=osd,device-class=hdd --pg_autoscale_mode on --application cephfs --add_storage 0 --target-size-ratio 1 ceph osd pool set erasure42hdd-data allow_ec_overwrites true ceph osd pool set erasure42hdd-data bulk true ceph osd pool application enable erasure42hdd-data cephfs --yes-i-really-mean-it ceph osd pool application disable erasure42hdd-data rbd --yes-i-really-mean-it ceph fs new cephfshdd erasure42hdd-metadata erasure42hdd-data --force
- check with
ceph fs ls
, it should output1 2
name: cephfsssd, metadata pool: erasure42ssd-metadata, data pools: [erasure42ssd-data ] name: cephfshdd, metadata pool: erasure42hdd-metadata, data pools: [erasure42hdd-data ]
- set all metadata pools to use replicated_ssd CRUSH rule
1 2
ceph osd pool set erasure42ssd-metadata crush_rule replicated_ssd ceph osd pool set erasure42hdd-metadata crush_rule replicated_ssd
- change CEPH FS maximum file size limit from default 1TB to 5TB for each filesystem
- first check the current limit, it should be 1099511627776
1 2
ceph fs get cephfsssd | grep max_file_size ...
- then set the new limit
1 2
ceph fs set cephfsssd max_file_size 5497558138880 ...
- first check the current limit, it should be 1099511627776
(optional) add the pools as storage to the cluster using GUI (Datacenter -> Storage), give them names
cfssd
andcfhdd
respectively- create clients and keyrings for each cephfs, use paths for flexibility
1 2
sudo ceph auth get-or-create client.userssd mon 'allow r' mds 'allow r path=/, allow rwps path=/userssd' osd 'allow rw pool=erasure42ssd-data' -o /etc/ceph/ceph.client.userssd.keyring ...
- create directories for the clients
1 2
mkdir /mnt/pve/cfssd/userssd ...
Mount CEPH FS in a LXC
LXC must use CEPH FS through bind mounts on the host PVME node. Do the following on the PMVE host node:
- create directories for the bind mounts for each filesystem
1 2
mkdir -p /mnt/bindmounts/all_hdd ...
- edit host’s fstab
nano /etc/fstab
and add the following linesline for each mount point (can be multiple mount points per pool)1 2
192.168.100.206,192.168.100.207,192.168.100.208:/userssd/all_ssd /mnt/bindmounts/all_ssd ceph mds_namespace=cephfsssd,name=userssd,secretfile=/etc/pve/priv/ceph.client.userssd.keyring,noatime,nodiratime,noacl,_netdev,mon_addr=192.168.100.206/192.168.100.207/192.168.100.208 0 2 ...
etc
- if the bind mounts fail to mount after reboot:
nano /etc/systemd/system/manualfstab.service
- add the following lines
1 2 3 4 5 6 7 8 9 10 11 12
[Unit] Description=Mount host fstab manually after=network.target [Service] Type=idle User=root ExecStart=/bin/sh -c "mount -a" Restart=on-failure [Install] WantedBy=multi-user.target
- add the following lines
1 2
systemctl enable manualfstab.service systemctl start manualfstab.service
- check the container config for existing mountpoints, increment the number for new mountpoints
cat /etc/pve/lxc/100.conf | grep mp
pct set 100 -mp3 /mnt/bindmounts/all_hdd,mp=/mnt/all_hdd
- or edit the config file directly
nano /etc/pve/lxc/100.conf
:1
mp3: /mnt/bindmounts/all_hdd,mp=/mnt/all_hdd
Mount CEPH FS in a VM (or any other server)
sudo apt install ceph-common
- copy the keyrings from the PMVE host to the VM
1 2
scp root@192.168.100.206:/etc/ceph/ceph.client.userssd.keyring /etc/ceph/ceph.client.userssd.keyring ...
- copy ceph config
1
scp root@192.168.100.206:/etc/pve/ceph.conf /etc/ceph/ceph.conf
sudo nano /etc/ceph/ceph.conf
, remove all blocks except for [global], add the following lines1 2 3 4 5
[client.userssd] keyring = /etc/ceph/client.userssd.keyring [client.userhdd] keyring = /etc/ceph/client.userhdd.keyring
sudo nano /etc/fstab
, add lines for each mount point1 2
192.168.100.206,192.168.100.207,192.168.100.208:/userssd/all_ssd /mnt/all_ssd ceph mds_namespace=cephfsssd,name=userssd,secretfile=/etc/ceph/client.userssd.keyring,noatime,nodiratime,noacl,_netdev,mon_addr=192.168.100.206/192.168.100.207/192.168.100.208 0 2 ...
naturally,
mkdir -p /mnt/all_ssd
etcsudo mount -a
- check with
mount
anddf -h
References
- https://docs.ceph.com/en/reef/
- https://knowledgebase.45drives.com/kb/creating-client-keyrings-for-cephfs/
- https://knowledgebase.45drives.com/kb/create-ec-profile/
- https://www.ibm.com/docs/en/storage-ceph/6?topic=size-specifying-target-using-total-cluster-capacity
- https://forum.proxmox.com/threads/best-practice-for-mounting-cephfs-for-both-proxmox-storage-and-lxc-bind-mount.135146/
This post is licensed under CC BY 4.0 by the author.