How to perform maintenance of a hardware node which is a part of Virtuozzo storage (Parallels Cloud Storage)?
Here are the frequent questions related to maintenance.
- How to install product updates to nodes of the cluster?
- Is it necessary to remove roles from the server during maintenance?
- How to replace failed disk with CS role?
- SSD disk failed, how to replace the drive?
- Is it possible to completely redeploy the cluster?
- If network equipment needs maintenance, what should be done?
The best practice for updates installation is to update nodes one by one, checking and ensuring that all services are running after installing updates. The process for each node is:
The server holding MDS master role is recommended to be updated after all other servers with MDS role. For Virtuozzo 6.0.7 or earlier, MDS master is to be updated in the very end.
Reason: MDS protocol compatibility is one-way, upward compatible. Older non-master MDS may fail to keep journal consistency with newer master.
It is strongly recommended to install updates to one server with MDS/CS role at time.
Reason: For MDS servers, this is to keep quorum. For CS servers, this is to provide minimum replication level and avoid performance degradation.
- Once updates are installed and services are restarted, perform verification of services stability, that no service is crashing.
Before answering this question, it is necessary to clarify certain points of configuration.
- Here we assume that the replication level is set to 3:2. (details)
- The replica storing policy (failure domains) is to store 1 replica per a host with CS role.
- There are 5 or more MDS servers in the cluster.
- There is no other node disabled, powered off, or planned to be restarted in the near future.
Note: The points above describe minimum recommended cluster configuration.
For a short maintenance period (up to 2 hours), it is safe to leave the system as is. For longer periods, it depends on the role and amount of data to replicate:
MDS service contains relatively small amount of data (typically less than 10GB, including journal), time to recreate MDS server is small.
The recommendation is to drop MDS role and create it on some other server to conform with the recommended configuration guidelines.
CS service manages large portions of data, few TBs is not an extraordinary situation, and replicating this amount of data can take much time.
Removal of CS instance will initiate replication of stored data. If no other CS server is to be restarted or powered off, 1 server with CS role can be taken off without removal of managed CSes. Turning off any additional node with CS roles should be done only after full replication is completed.
The steps are provided in the documentation, Replacing Disks Used as Chunk Servers.
The exception is the case when the disk is no longer recognized by the system and no data can be read from the disk.
Stop CS service using the mount-point in the argument:
~# service pstorage-csd stop /pstorage/CLUSTER_NAME-CSN
~# pstorage -c CLUSTER_NAME -f rm-cs CS_ID
Check that no process is holding the mount point, terminate found processes:
~# fuser -auv /pstorage/CLUSTER_NAME-CSN
Umount the file-system, replace the disk, and so on.
~# umount /pstorage/CLUSTER_NAME-CSN
It is recommended to check periodically whether SSD is healthy.
SSD can be use with different cluster roles, instructions for SSD replacement may vary depending on its usage:
If it is not related to the network segment which is dedicated for storage operations, then no additional action is needed.
For storage network, if there is no redundancy (e.g. only one interface is dedicated to storage network, or all network communications are done via the single network switch), then all services on Virtuozzo hosts should be stopped in the order defined in the documentation.
Note: All virtual environments, iSCSI targets will be stopped.
Stop services depending on Virtuozzo storage mount on all nodes:
~# for svc in pvapp pvaagent shaman parallels-server vz pstorage-iscsi; do service $svc stop; done
Stop Virtuozzo storage mount on all nodes:
~# service pstorage-fs stop
Check and ensure that there is no "pstorage://" mount at this point, terminate processes holding the mount point and unmount the storage.
Stop services related to metadata functionality on nodes with this role installed:
~# service pstorage-mdsd stop
Stop services related to chunk server functionality on nodes with this role installed:
~# service pstorage-csd stop
Perform necessary operations with network equipment - switch replacement, firmware upgrade, etc.
Start metadata and chunk services back, on all nodes with these roles installed:
~# for svc in pstorage-mdsd pstorage-csd; do service $svc start; done
Check that the cluster works:
~# pstorage -c CLUSTER_NAME stat
Start other services on all nodes:
~# for svc in pstorage-fs pstorage-iscsi vz parallels-server shaman pvaagent pvapp; do service svc start; done
At this point the cluster should be back online.