Article ID: 117662, created on Oct 2, 2013, last review on Jun 17, 2016

  • Applies to:
  • Virtuozzo 6.0

Symptoms

Any cluster operation fails:

~# pstorage -c $CLUSTER_NAME top
08-08-13 17:40:19.647 Unable connect to cluster, timeout (30 sec) expired.

Cause

There are too few active MDS servers, the cluster needs more than a half of registered MDS server online to be operable. Servers can be crashing with the following snip in logs:

~# tail -3 /var/log/pstorage/$CLUSTER_NAME/mds-1/fatal.log 
08-08-13 17:01:00.156 mdsd #8 reports hard error (134 / SIGABRT)
08-08-13 17:01:30.293 mdsd #8 reports hard error (134 / SIGABRT)
08-08-13 17:02:00.418 mdsd #8 reports hard error (134 / SIGABRT)

Check the log file /var/log/pstorage/$CLUSTER_NAME/mds-1/mds.log.gz for more information - there could be call traces which help to identify the root cause:

  1. Known issue PSBM-21516, fixed in PCS 6.0 update 4:

    08-08-13 17:51:25.513 rjournal: replaying snapshot /pstorage/pcs-storage-cluster-mds/ploop_mds.img.mnt/data/journal.313.sn
    08-08-13 17:51:25.521 Global: 08-08-13 17:51:25 MDS Default lease timeout is set to 60000 msec
    08-08-13 17:51:25.521 added paxos node 1, addr 91.184.28.142:2510
    08-08-13 17:51:25.521 paxos node 1 became active
    08-08-13 17:51:25.521 added paxos node 2, addr 91.184.28.143:2510
    08-08-13 17:51:25.521 paxos node 2 became active
    08-08-13 17:51:25.522 added paxos node 3, addr 91.184.28.144:2510
    08-08-13 17:51:25.522 paxos node 3 became active
    08-08-13 17:51:25.522 added paxos node 5, addr 91.184.28.146:2510
    08-08-13 17:51:25.522 paxos node 5 became active
    08-08-13 17:51:25.525 added paxos node 6, addr 91.184.28.141:2510
    08-08-13 17:51:25.525 paxos node 6 became active
    08-08-13 17:51:25.525 added paxos node 7, addr 91.184.28.147:2510
    08-08-13 17:51:25.525 paxos node 7 became active
    08-08-13 17:51:25.527 added paxos node 8, addr 91.184.28.160:2510
    08-08-13 17:51:25.527 paxos node 8 became active
    08-08-13 17:51:25.550 VERSION: 9 from node #2 ver 9
    08-08-13 17:51:25.566 BUG at cs_wd.c:1656/wd_set_adm_status()
    08-08-13 17:51:25.566 pstorage version: 6.0.3-42 (Debug)
    08-08-13 17:51:25.566 ---------- [11 stack frames] ----------
    08-08-13 17:51:25.566 /usr/lib64/libpcs_io.so(+0x1c137) [0x7f0933323137]
    08-08-13 17:51:25.566 /usr/lib64/libpcs_io.so(show_trace+0xb5) [0x7f0933323275]
    08-08-13 17:51:25.566 /usr/lib64/libpcs_io.so(pcs_err+0x35) [0x7f0933322835]
    08-08-13 17:51:25.566 /usr/lib64/libpcs_io.so(+0x1b858) [0x7f0933322858]
    08-08-13 17:51:25.566 /usr/bin/mdsd() [0x460bba]
    08-08-13 17:51:25.566 /usr/bin/mdsd() [0x456995]
    08-08-13 17:51:25.566 /usr/bin/mdsd(journal_replay+0x4b6) [0x468966]
    08-08-13 17:51:25.566 /usr/bin/mdsd(rjournal_replay+0x1a7) [0x46b977]
    08-08-13 17:51:25.566 /usr/bin/mdsd(main+0x590) [0x434080]
    08-08-13 17:51:25.566 /lib64/libc.so.6(__libc_start_main+0xfd) [0x7f0932d8dcdd]
    08-08-13 17:51:25.566 /usr/bin/mdsd() [0x433829]
    
  2. Known issue PSBM-20561, fixed in PCS 6.0 Update 3:

    12-06-13 05:02:18.862 neigh_check_accepted: nid=5 wants connection
    12-06-13 05:02:18.862 neigh_do_auth: send prefered auth_name = digest
    12-06-13 05:02:18.862 neigh_do_auth: nid=5 tells its id
    12-06-13 05:02:18.862 neigh_do_sec_auth
    12-06-13 05:02:18.863 neigh_do_auth: nid=5 tells its id
    12-06-13 05:02:18.863 neigh_do_sec_auth
    12-06-13 05:02:18.863 neigh_do_auth: nid=5 tells its id
    12-06-13 05:02:18.863 neigh_set_connected: neigh auth passed (id=5, ver=1, srv ver=9)
    12-06-13 05:02:18.863 check_version: current ver 9/9
    12-06-13 05:02:19.631 neigh: 0x5517fa0 connected err=111
    12-06-13 05:02:20.630     wd_replay_cs_drop: CS#1035
    12-06-13 05:02:20.631     wd_set_adm_status: CS#1035 failed releasing -> dropped
    12-06-13 05:02:20.631     release_chunk_node: #1f1[0x4000000][#1035]
    12-06-13 05:02:20.631     cl_dequeue_trash: #1f1[0x4000000][#1035]
    12-06-13 05:02:20.631 BUG at cs_wd.c:274/check_chunk_status()
    12-06-13 05:02:20.631 pstorage version: 6.0.3-37 (Debug)
    12-06-13 05:02:20.631 ---------- [21 stack frames] ----------
    12-06-13 05:02:20.631 /usr/lib64/libpcs_io.so() [0x303be1c225]
    12-06-13 05:02:20.631 /usr/lib64/libpcs_io.so(show_trace+0xa5) [0x303be1c355]
    12-06-13 05:02:20.631 /usr/lib64/libpcs_io.so(pcs_err+0x3b) [0x303be1b90b]
    12-06-13 05:02:20.631 /usr/lib64/libpcs_io.so() [0x303be1b938]
    12-06-13 05:02:20.631 /usr/bin/mdsd(check_chunk_status+0xe3) [0x45c103]
    12-06-13 05:02:20.631 /usr/bin/mdsd(wd_set_adm_status+0x161) [0x460a81]
    12-06-13 05:02:20.631 /usr/bin/mdsd(cl_set_cs_status+0xab) [0x4574eb]
    12-06-13 05:02:20.631 /usr/bin/mdsd(wd_replay_cs_drop+0x120) [0x45c490]
    12-06-13 05:02:20.631 /usr/bin/mdsd() [0x46bf94]
    12-06-13 05:02:20.631 /usr/bin/mdsd(paxos_next_round+0x8c) [0x46efac]
    12-06-13 05:02:20.631 /usr/bin/mdsd() [0x471962]
    12-06-13 05:02:20.631 /usr/bin/mdsd() [0x471bd7]
    12-06-13 05:02:20.631 /usr/bin/mdsd(learner_rcv_learn+0xce) [0x471d7e]
    12-06-13 05:02:20.631 /usr/bin/mdsd() [0x46f38f]
    12-06-13 05:02:20.631 /usr/bin/mdsd() [0x472ae2]
    12-06-13 05:02:20.631 /usr/lib64/libpcs_io.so() [0x303be2011d]
    12-06-13 05:02:20.631 /usr/lib64/libpcs_io.so() [0x303be0c8e6]
    12-06-13 05:02:20.631 /usr/lib64/libpcs_io.so() [0x303be0b1f9]
    12-06-13 05:02:20.631 /usr/lib64/libpcs_io.so() [0x303be0b56a]
    12-06-13 05:02:20.631 /lib64/libpthread.so.0() [0x309c407851]
    12-06-13 05:02:20.631 /lib64/libc.so.6(clone+0x6d) [0x309bce88ed]
    12-06-13 05:02:20.632     pcs_log_terminate
    

Resolution

All known cases are fixed in the updates, thus check and install updates to all servers in the cluster.

Refer to KB article #116673 for instructions.

Search Words

Unable connect to cluster

mds crash

cluster fails

c62e8726973f80975db0531f1ed5c6a2 2897d76d56d2010f4e3a28f864d69223 0dd5b9380c7d4884d77587f3eb0fa8ef

Email subscription for changes to this article
Save as PDF