[Libstoragemgmt-devel] [RFC] The status relationship of system, pool, volume.

Discussion:

Gris Ge

2014-08-12 12:49:58 UTC

Hi Guys,

When documenting the status of system, pool, volumes, I found we didn't
have clear definition about their relationship[1].

I would like to propose one:
System might have these physical hardwares:
1. Disks.
2. Front-end ports. # Target ports
3. Back-end ports. # used to connect disks
4. Battery
5. Fan
6. Power supply.
7. Temperature sensor.
8. RAM
9. Internal links. # used to connecting cabinets or controllers.
10. Controller.

System has much more logical things(pool, volume, access group and etc).

In order to reflect user's expectation, the status of system only
reflect hardware errors:

1. STATUS_ERROR, indicate a hardware error. Even it does not impact
user's I/O, we still treat it as ERROR:

1. Disk broken.
# Don't treat 'not connected' as error.
2. Front-end ports broken.
# Don't treat 'not connected' as error.
3. Back-end ports broken.
# Even has another one still functional.
4. Battery drained or end of life.
# When Battery is charging or not installed,
# set as STATUS_STATUS_DEGRADED
5. Fan is broken or disconnected.
6. Temperature exceeded max limitation or no reading.
7. Lose one power supply
8. RAM broken
# Even when another one is still functional.
9. Internal link down or disconnected.
10. Some Controller not accessible or down.

2. STATUS_PREDICTIVE_FAILURE, indicate a hardware will die.
1. Battery is near end of life.
2. Some disk is near end of life.
# For example SSD SMART E9 Media Wearout attribute.
3. Some hardware is near end of life.

3. STATUS_STATUS_DEGRADED, indicate current system is not fully
supported.
1. If battery is supported, battery not installed or charging.
2. If has sensor, the temperature exceed warning level.
3. If spare disk is supported, not all pools are protected by
spare disk.

4. STATUS_OK, indicate all hardware is up and running.

Pool status reflect RAID or storage space status. If any problem will
gone by deleting volume/FS, that problem should be recorded in
volume/FS level, not pool level.
Space full or near full will not be considered as error. User
application can simply check free_space.

Volume/FS status will be override by pool status. It could also hold
individual volume status like STATUS_ERROR when Thin-P space is full.

So generally,
1. System status only reflect hardware status.
# Some administrator might not think a battery error as a system error
# but some do. We'd better raise error instead of ignore error.
2. System might not impact pool status or volume/FS status directly.
3. Pool status impact volume/FS status.
4. Volume/FS status can also hold individual status.

How does this sound?

Thank you in advance.
Best regards.

[1] How could a disk or pool failure summarized to system status.

--
Gris Ge

Gris Ge

2014-08-12 13:24:02 UTC

Permalink

Post by Gris Ge
Hi Guys,
So generally,
1. System status only reflect hardware status.
# Some administrator might not think a battery error as a system error
# but some do. We'd better raise error instead of ignore error.
2. System might not impact pool status or volume/FS status directly.
3. Pool status impact volume/FS status.
4. Volume/FS status can also hold individual status.

On a second though, in stead of planing in no implementation, I would
like to remove all unused status. We can add them back once plugin found
real use.

Currently, ontap plugin just set system status as OK, nstor as UNKNOWN,
SMI-S are based on un-standard guess.

In stead of shipping void-based design, adding needed status in future
might be more reasonable.

Best regards.

--
Gris Ge

Tony Asleson

2014-08-12 14:56:38 UTC

Permalink

Post by Gris Ge

On a second though, in stead of planing in no implementation, I would
like to remove all unused status. We can add them back once plugin found
real use.
Currently, ontap plugin just set system status as OK, nstor as UNKNOWN,
SMI-S are based on un-standard guess.
In stead of shipping void-based design, adding needed status in future
might be more reasonable.

I agree having information that doesn't report actual state of things is
not useful, mostly harmful. It was my intention that we would be
filling these things in with useful values, but unfortunately that
hasn't always been straight forward or thus far even possible.

I'm going to do some more investigating today/tomorrow to see if we can
get some better values in these fields, at least for ontap & targetd.
If we can't we should remove the status fields and add then add them
back later as you suggest.

Thanks,
Tony

Tony Asleson

2014-08-12 14:38:49 UTC

Permalink

Post by Gris Ge
Hi Guys,
When documenting the status of system, pool, volumes, I found we didn't
have clear definition about their relationship[1].
1. Disks.
2. Front-end ports. # Target ports
3. Back-end ports. # used to connect disks
4. Battery
5. Fan
6. Power supply.
7. Temperature sensor.
8. RAM
9. Internal links. # used to connecting cabinets or controllers.
10. Controller.
System has much more logical things(pool, volume, access group and etc).
In order to reflect user's expectation, the status of system only
1. STATUS_ERROR, indicate a hardware error. Even it does not impact
1. Disk broken.
# Don't treat 'not connected' as error.
2. Front-end ports broken.
# Don't treat 'not connected' as error.
3. Back-end ports broken.
# Even has another one still functional.
4. Battery drained or end of life.
# When Battery is charging or not installed,
# set as STATUS_STATUS_DEGRADED
5. Fan is broken or disconnected.
6. Temperature exceeded max limitation or no reading.
7. Lose one power supply
8. RAM broken
# Even when another one is still functional.
9. Internal link down or disconnected.
10. Some Controller not accessible or down.
2. STATUS_PREDICTIVE_FAILURE, indicate a hardware will die.
1. Battery is near end of life.
2. Some disk is near end of life.
# For example SSD SMART E9 Media Wearout attribute.
3. Some hardware is near end of life.
3. STATUS_STATUS_DEGRADED, indicate current system is not fully
supported.
1. If battery is supported, battery not installed or charging.
2. If has sensor, the temperature exceed warning level.
3. If spare disk is supported, not all pools are protected by
spare disk.
4. STATUS_OK, indicate all hardware is up and running.
Pool status reflect RAID or storage space status. If any problem will
gone by deleting volume/FS, that problem should be recorded in
volume/FS level, not pool level.
Space full or near full will not be considered as error. User
application can simply check free_space.
Volume/FS status will be override by pool status. It could also hold
individual volume status like STATUS_ERROR when Thin-P space is full.
So generally,
1. System status only reflect hardware status.
# Some administrator might not think a battery error as a system error
# but some do. We'd better raise error instead of ignore error.
2. System might not impact pool status or volume/FS status directly.
3. Pool status impact volume/FS status.
4. Volume/FS status can also hold individual status.
How does this sound?

It sound reasonable. My thoughts when putting status fields in was a
tree like top down status. Thus system status == OK all other
components were good. The individual status indicators was to all the
user to figure out which individual component(s) were in a degraded or
error state.

Regards,
Tony