Article ID: 117122, created on Sep 5, 2013, last review on Jun 17, 2016

  • Applies to:
  • Virtuozzo

Symptoms

Parallels Cloud Storage might become inoperable if the following occurs:

  1. pstorage tool reports that it cannot connect to PCS cluster from any node:

    ~# pstorage -c PCLUSTER stat
    25-08-13 02:54:15.427     Unable connect to cluster, timeout (30 sec) expired.
    
  2. Few servers with MDS role do not have free disk space on the root partition:

    ~# df -h /
    Filesystem            Size  Used Avail Use% Mounted on
    /dev/sda2              30G   30G     0 100% /
    

The following error messages can be found in the end of the log file of MDS server:

  1. Inability to bind to a socket or to replay journal on (re)starting the service:

    06-06-13 20:06:41.099 Replaying journal ...
    06-06-13 20:06:41.108 rjournal: replaying snapshot /pstorage/PCLUSTER-mds/journal.1.sn
    06-06-13 20:06:41.108 Global: 06-06-13 20:06:41 MDS Default lease timeout is set to 60000 msec
    06-06-13 20:06:41.108 Global: 06-06-13 20:06:41 MDS Default replication level is set to 1..1/3 QoS=0 Gr=1
    06-06-13 20:06:41.108 Global: 06-06-13 20:06:41 MDS Minimum replicas set to 1, which means that cluster can be unable to survive loss of a single machine!
    06-06-13 20:06:41.108 added paxos node 1, addr 10.13.96.62:2510
    06-06-13 20:06:41.108 paxos node 1 became active
    06-06-13 20:06:41.867 Fatal: can't set local address 10.13.96.62:2510, err 98 (Address already in use)
    06-06-13 20:06:41.867 ---------- [11 stack frames] ----------
    06-06-13 20:06:41.867 /usr/lib64/libpcs_io.so(+0x1c735) [0x7ff6dd4ff735]
    06-06-13 20:06:41.867 /usr/lib64/libpcs_io.so(show_trace+0xb5) [0x7ff6dd500445]
    06-06-13 20:06:41.867 /usr/lib64/libpcs_io.so(pcs_fatal+0xe3) [0x7ff6dd501603]
    06-06-13 20:06:41.867 /usr/bin/mdsd() [0x4719c8]
    06-06-13 20:06:41.867 /usr/bin/mdsd(paxos_nodes_add+0x257) [0x471db7]
    06-06-13 20:06:41.867 /usr/bin/mdsd() [0x471e49]
    06-06-13 20:06:41.867 /usr/bin/mdsd(journal_replay+0x56f) [0x46736f]
    06-06-13 20:06:41.867 /usr/bin/mdsd(rjournal_replay+0x1a7) [0x46a7d7]
    06-06-13 20:06:41.867 /usr/bin/mdsd(main+0x5a8) [0x433ed8]
    06-06-13 20:06:41.867 /lib64/libc.so.6(__libc_start_main+0xfd) [0x3c5e21ecdd]
    06-06-13 20:06:41.867 /usr/bin/mdsd() [0x433669]
    
  2. Lack of free disk space on file creation on (re)starting the service:

    ~# tail /var/log/pstorage/PCLUSTER/mds-3/fatal.log 
    01-09-13 13:01:03.376 failed to close compressed log file /var/log/pstorage/PCLUSTER/mds-3/mds.log.gz: No space left on device
    01-09-13 13:01:03.388 mdsd #7 reports hard error (134 / SIGABRT)
    01-09-13 13:01:46.105 failed to close compressed log file /var/log/pstorage/PCLUSTER/mds-3/mds.log.gz: No space left on device
    01-09-13 13:01:46.105 failed to close compressed log file /var/log/pstorage/PCLUSTER/mds-3/mds.log.gz: No space left on device
    01-09-13 13:01:46.117 mdsd #7 reports hard error (134 / SIGABRT)
    01-09-13 13:02:28.849 failed to close compressed log file /var/log/pstorage/PCLUSTER/mds-3/mds.log.gz: No space left on device
    01-09-13 13:02:28.849 failed to close compressed log file /var/log/pstorage/PCLUSTER/mds-3/mds.log.gz: No space left on device
    01-09-13 13:02:28.861 mdsd #7 reports hard error (134 / SIGABRT)
    01-09-13 13:03:11.579 failed to close compressed log file /var/log/pstorage/PCLUSTER/mds-3/mds.log.gz: No space left on device
    01-09-13 13:03:11.579 failed to close compressed log file /var/log/pstorage/PCLUSTER/mds-3/mds.log.gz: No space left on device
    

Cause

There is no quorum of MDS servers, so the cluster cannot be operational in such conditions and the clients become suspended with all read and write operations frozen. It is very important to make sure that there are enough (at least a half and one out of all registered) MDS servers which are running without issues.

Resolution

  1. Locate and remove large files from the root partition (except log files).

    ~# find / -xdev -size +100M
    
  2. Once there is 5-10 GB of free disk space, restart the service of MDS:

    ~# service pstorage-mdsd restart
    

To prevent the situation from reoccurring, ensure that the root partition (with /var directory) has enough free disk space, avoid filling it completely.

  1. Avoid storing backups on the root partition.
  2. Ensure that kernel crash dumps are saved to the different partition or even host, KB #10044.

Search Words

No space left on device

cluster unavailable when master node is down

Unable connect to cluster, timeout (30 sec) expired

replaying snapshot

master offline

mds failed

err 98 (Address already in use)

hard error (134 / SIGABRT)

0dd5b9380c7d4884d77587f3eb0fa8ef 2897d76d56d2010f4e3a28f864d69223

Email subscription for changes to this article
Save as PDF