Article ID: 118404, created on Nov 3, 2013, last review on Jan 8, 2016

  • Applies to:
  • Virtuozzo 6.0

Symptoms

A Chunk Server is marked as "failed":

~# pstorage -c $CLUSTER_NAME stat | grep failed
1031 failed        1TB  380GB    10816    15%      1/12   0.20 192.168.1.31

Starting pstorage top and hitting 'i' several times (to show "FLAGS" field) should reveal the cause of failure:

  CSID TIER    COST ERR  LAST_ERR         LAST_LINK_ERR    JRN_FULL FLAGS
  1520    0     815 None   6,  5h 59m ago  11,  2d  0h ago       0% JCS
  1521    0     883 None   6,   4 min ago   4,  1d 23h ago       0% JCcH

The list of reasons why CS failed:

  • "H" - HDD failed (returned I/O error)
  • "h" - HDD data checksum failed
  • "S" - SSD failed (returned I/O error)
  • "s" - SSD data checksum failed
  • "R" - broken repository. CS couldn’t find its repository
  • "T" - I/O request timeout

Cause

1. A Chunk Server will be marked as "failed" when hardware errors occur on the Chunk Server's HDD.

First example:

Nov  3 16:35:35 cs1031 kernel: [3776877.755229] sd 4:1:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov  3 16:35:35 cs1031 kernel: [3776877.755235] sd 4:1:0:0: [sda] Sense Key : Aborted Command [current]
Nov  3 16:35:35 cs1031 kernel: [3776877.755243] sd 4:1:0:0: [sda] Add. Sense: Ack/nak timeout
Nov  3 16:35:35 cs1031 kernel: [3776877.755248] sd 4:1:0:0: [sda] CDB: Read(10): 28 00 29 b6 59 08 00 00 08 00

Second example:

~# dmesg | less
...
Jun  5 19:33:52 pcsnode kernel: [32954.094513] ata2.00: status: { ERR }
Jun  5 19:33:52 pcsnode kernel: [32954.094514] ata2.00: error: { ABRT }
Jun  5 19:33:52 pcsnode kernel: [32954.094516] ata2: hard resetting link
... 

Note: The errors may differ from those mentioned above - any serious hardware error will lead to a chunk server failed state.

Parallels Cloud Storage detects hardware errors on the Chunk Server and marks it as permanently failed to avoid any possible data loss.

This will be represented by the following events:

~# pstorage -c $CLUSTER_NAME get-event | grep 'CS#1031'
03-11-13 16:35:35.563 MDS WRN: CS#1031 have reported hard error on 'root.hds'
03-11-13 16:35:35.576 MDS WRN: CS#1031 is failed permanently and will not be used for new chunks allocation

2. Chunk Server will be marked as "failed" when hardware errors occur on an SSD drive that is used for Write Journalling.

For Chunk Servers that are using SSD drives for Write Journalling on Chunk Server losing journal is a critical failure, as data integrity cannot be verified.

122650 [HOW TO] Verify if the SSD disk is healthy

Resolution

  1. For Disks used as Chunk Servers follow Replacing Disks Used as Chunk Servers

  2. For SSD Disk used for Write Journalling follow Failed Write Journalling SSDs

Related topics

122532 Chunk server failed suddenly, causing high load on the server

123227 SSD failure on PCS Storage node - how do I replace the disk?

118831 What to check in case of I/O performance degradation on Parallels Cloud Storage?

Search Words

csd failed

cluster server offline

chunk failed

Failed to allocate 2 replicas for '.uptime' by request from 172.16.1.50:37935

cs show fails

re adding chunk server

CS fail

not aplicable

is failed permanently and will not be used for new chunks allocation

cant setup metadata server

create chunk storage

Chunk Server

died unexpectedly

pstorage drive out of space

at_io_parallel.sh

chunk

drives showing as failed

failed

cs

0dd5b9380c7d4884d77587f3eb0fa8ef 2897d76d56d2010f4e3a28f864d69223 c62e8726973f80975db0531f1ed5c6a2

Email subscription for changes to this article
Save as PDF